throbber
Retrieval C.A. Montgomery
`Information
`Editor
`Processing
`and Language
`
`G.Salton, A. Wong
`
`and C. S. Yang
`
`Cornell University
`
`Space Model
`A Vector
`Indexing
`for Automatic
`
`
`
`1.Document Space Configurations
`
`Consider a document space consisting of documents
`
`
`
`
`Di , each identified by one or more index terms Ti;
`
`
`the terms may be weighted according to their im­
`
`
`
`
`portance, or unweighted with weights restricted to 0
`
`
`index space is and 1.1 A typical three-dimensional
`
`shown in Figure 1, where each item is identified by up to
`
`
`three distinct terms. The three-dimensional example
`
`
`may be extended to t dimensions when t different
`
`index terms are present. In that case, each document
`Di is represented
`
`by a t-dimensional vector
`
`CR Categories: 3.71, 3.73, 3.74, 3.75
`
`d;j representing the weight of the jth term.
`
`matching
`or other pattern
`retrieval,
`In a document
`
`
`Given the index vectors for two documents, it is
`(documents)
`are
`entities
`where stored
`environment
`
`
`
`possible to compute a similarity coefficient between
`patterns
`with each other or with incoming
`compared
`
`them, s(Di, D;), which reflects the degree of similarity
`best indexing
`that the
`it appears
`requests),
`(search
`
`
`in the corresponding terms and term weights. Such a
`lies as far away
`space is one where each entity
`(property)
`
`
`
`similanty measure might be the inner product of the
`in these circumstances
`the
`as possible;
`from the others
`
`
`
`
`two vectors, or alternatively an inverse function of the
`system may be expressible
`as a
`value of an indexing
`
`
`
`angle between the corresponding vector pairs; when the
`in particular,
`space;
`of the object
`of the density
`function
`
`
`
`term assignment for two vectors is identical, the angle
`with space
`inversely
`may correlate
`performance
`retrieval
`
`
`will be zero, producing a maximum similarity measure.
`computations
`based on space density
`An approach
`density.
`
`
`
`Instead of identifying each document by a complete
`vocabulary
`for a
`indexing
`an optimum
`is used to choose
`
`
`
`
`vector originating at the 0-point in the coordinate sys­
`
`
`
`
`
`collection of documents. Typical evaluation results are
`
`
`
`
`tem, the relative distance between the vectors is pre­
`
`
`shown, demonstating the usefulness of the model.
`
`
`served by normalizing all vector lengths to one, and
`
`
`Key Words and Phrases: automatic information
`
`
`
`considering the projection of the vectors onto the en­
`
`
`
`
`
`retrieval, automatic indexing, content analysis, document
`
`
`velope of the space represented by the unit sphere. In
`space
`
`
`that case, each document may be depicted by a single
`
`
`
`point whose position is specified by the area where the
`
`
`
`corresponding document vector touches the envelope
`
`
`
`of the space. Two documents with similar index terms
`are then represented by points that are very close to­
`
`
`
`gether in the space, and, in general, the distance be­
`
`tween two document points in the space is inversely
`
`
`
`
`Copyright© 1975, Association for Computing Machinery, Inc.
`
`General permission to republish, but not for profit, all or part
`
`
`
`
`
`
`correlated with the similarity between the correspond­
`
`
`
`
`of this material is granted provided that ACM's copyright notice
`ing vectors.
`
`
`is given and that reference is made to the publication, to its date
`Since the configuration of the document space is a
`
`
`
`
`
`of issue, and to the fact that reprinting privileges were granted
`
`
`
`by permission of the Association for Computing Machinery.
`
`
`function of the manner in which terms and term weights
`
`
`This study was supported in part by the National Science
`
`
`
`are assigned to the various documents of a collection,
`
`
`
`
`Foundation under grant GN 43505. Authors' addresses: G. Salton
`
`one may ask whether an optimum document space
`
`
`
`
`and A. Wong, Department of Computer Science, Cornell Univer­
`
`
`
`
`
`of Computer Sci­sity, Ithaca, NY 14850; C. S. Yang, Department
`
`
`
`configuration exists, that is, one which produces an
`
`ence, The University of Iowa, Iowa City, IA, 52240.
`
`optimum retrieval performance.
`
`
`
`1 Although we speak of documents and index terms, the present
`
`
`If nothing special is known about the documents
`
`
`
`
`development applies to any set of entities identified by weighted
`
`property vectors.
`
`
`under consideration, one might conjecture that an
`
`
`
`
`2 Retrieval performance is often measured by parameters such
`
`
`
`ideal document space is one where documents that are
`
`
`
`items actually as recall and precision, reflecting the ratio of relevant
`
`
`
`
`jointly relevant to certain user queries are clustered
`
`
`
`
`retrieved and of retrieved items actually relevant. The question
`
`
`concerning optimum space configurations may then be more
`
`
`together, thus insuring that they would be retrievable
`
`
`
`conventionally expressed in terms of the relationship between
`
`
`
`
`jointly in response to the corresponding queries. Con­
`
`
`
`document indexing, on the one hand, and retrievctl performance,
`
`
`trariwise, documents that are never wanted simul-
`on the other.
`
`2
`
`613
`
`Communications
`November 1975
`
`of
`Volume 18
`Number 11
`the ACM
`
`IPR2019-01304
`BloomReach, Inc. EX1020 Page 1
`
`

`

`taneously would appear well separated in the docu- ment space. Such a situation is depicted in Figure 2, where the distance between two x's representing two documents is inversely related to the similarity between the corresponding index vectors. While the document configuration of Figure 2 may indeed represent the best possible situation, assuming that relevant and nonrelevant items with respect to the various queries are separable as shown, no practical way exists for actually producing such a space, because during the indexing process, it is difficult to anticipate what relevance assessments the user population will provide over the course of time. That is, the optimum configuration is difficult to generate in the absence of a priori knowledge of the complete retrieval history for the given collection. In these circumstances, one might conjecture that the next best thing is to achieve a maximum possible separation between the individual documents in the space, as shown in the example of Figure 3. Specifically, for a collection of n documents, one would want to minimize the function r = ~ ~ s(Di, D~), (1) i=l j=l where s(Di, Dr) is the similarity between documents i andj. Obviously when the function of eq. (1) is mini- mized, the average similarity between document pairs is smallest, thus guaranteeing that each given document may be retrieved when located sufficiently close to a user query without also necessarily retrieving its neigh- bors. This insures a high precision search output, since a given relevant item is then retrievable without also retrieving a number of nonrelevant items in its vicinity. In cases where several different relevant items for a given query are located in the same general area of the space, it may then also be possible to retrieve many of the relevant items while rejecting most of the nonrele- vant. This produces both high recall and high precision? Two questions then arise: first, is it in fact the case that a separated document space leads to a good re- trieval performance, and vice-versa that improved re- trieval performance implies a wider separation of the documents in the space; second, is there a practical way of measuring the space separation. In practice, the ex- pression of eq. (1) is difficult to compute, since the num- ber of vector comparisons is proportional to n 2 for a collection of n documents. For this reason, a clustered document space is best considered, where the documents are grouped into classes, each class being represented by a class centroid. 3 In practice, the best performance is achieved by obtaining for each user a desired recall level (a specified proportion of the relevant items); at that recall level, one then wants to maximize precision by retrieving as few of the nonrelevant items as possible. 4 A number of well-known clustering methods exist for auto- matically generating a clustered collection from the term vectors representing the individual documents [l]. 614 Fig. 1. Vector representation of document space. Y I t D 3 = (T I ", TZ" T3" ) L / / / / / / / T 3 D I = (T I , T 2 , T 3 ) .... ~ T 2 - (T I',TIt',T)') Fig. 2. Ideal document space. / r-~\ [ \ ~ ] Groups of Relevant-Items \ / X Individual Documents Fig. 3. Space with maximum separation between document pairs. Xx x X X X X X X X Individual Document Communications November 1975 of Volume 18 the ACM Number 11
`
`IPR2019-01304
`BloomReach, Inc. EX1020 Page 2
`
`

`

`Fig. 4. Clustered document space. • Cluster Centroid m Main centroid of clustering represents most closely the separated space shown for the unclustered case in Figure 3. If one assumes that documents that are closely related within a single cluster normally exhibit identical relevance characteristics with respect to most user queries, then the best retrieval performance should be obtainable with a clustered space exhibiting tight individual clusters, but large intercluster distances; that is, (a) the average similarity between pairs of documents within a single cluster should be maximized, while simultaneously (b) the average similarity between different cluster centroids is minimized. The reverse obtains for cluster organiza- tions not conducive to good performance where the individual clusters should be loosely defined, whereas the distance between different cluster cefitroids should be small. In the remainder of this study, actual performance figures are given relating document space density to retrieval performance, and conclusions are reached regarding good models for automatic indexing. A typical clustered document space is shown in Figure 4, where the various document groups are represented by circles and the centroids by black dots located more or less at the center of the respective clusters. 4 For a given document class K comprising m documents, each ele- ment of the centroid C may then be defined as the average weight of the same elements in the correspond- ing document vectors, that is, cj = (l/m) ~ dij. (2) i=1 DiCK Corresponding to the centroid of each individual document cluster, a centroid may be defined for the whole document space. This main centroid, repre- sented by a small rectangle in the center of Figure 4, may then be obtained from the individual cluster centroids in the same manner as the cluster centroids are computed from the individual documents. That is, the main centroid of the complete space is simply the weighted average of the various cluster centroids. In a clustered document space, the space density measure consisting of the sum of all pairwise document similarities, introduced earlier as eq. (1), may be re- placed by the sum of all similarity coefficients between each document and the main centroid; that is Q = ~ s(C*, Di), (3) i=1 where C* denotes the main centroid. Whereas the computation of eq. (1) requires n 2 operations, an evalu- ation of eq. (3) is proportional to n whenever s(Di, Dr) is proportional to the inner product of the corresponding vectors. Given a clustered document space such as the one shown in Figure 4, it is necessary to decide what type 615 2. Correlation between Indexing Performance and Space Density The main techniques useful for the evaluation of automatic indexing methods are now well understood. In general, a simple straightforward process can be used as a baseline criterion; for example, the use of certain word stems extracted from documents or docu- ment abstracts, weighted in accordance with the fre- quency of occurrence (fi k) of each term k in document i. This method is known as term-frequency weighting. Recall-precision graphs can be used to compare the performance of this standard process against the output produced by more refined indexing methods. Typically, a recall-precision graph is a plot giving precision figures, averaged over a number of user queries, at ten fixed recall levels, ranging from 0.1 to 1.0 in steps of 0.1. The better indexing method will of course produce higher precision figures at equivalent recall levels. One of the best automatic term weighting procedures evaluated as part of a recent study consisted of multi- plying the standard term frequency weight j5 k by a factor inversely related to the document frequency dk of the term (the number of documents in the collection to which the term is assigned). [2] Specifically, if dk is the document frequency of term k, the inverse document frequency IDFk of term k may be defined as [3]: (IDF)k = flog2 n 7 -- flog2 dk 7 + 1. A term weighting system proportional to (f~k.IDF~) will assign the largest weight to those terms which arise with high frequency in individual documents, but are at the same time relatively rare in the collection as a whole. It was found in the earlier study that the average improvement in recall and precision (average precision Communications November 1975 of Volume 18 the ACM Number 11
`
`IPR2019-01304
`BloomReach, Inc. EX1020 Page 3
`
`

`

`improvement at the ten fixed recall points) was about 14 percent for the system using inverse document fre- quencies over the standard term frequency weighting. The corresponding space density measurements are shown in Table I(a) using two different cluster organi- zations for a collection of 424 documents in aerody- namics: (i) Cluster organization A is based on a large number of relatively small clusters, and a considerable amount of overlap between the clusters (each document appears in about two clusters on the average); the clusters are defined from the docu- ment-query relevance assessments, by placing into a common class all documents jointly declared relevant to a given user query. (ii) Cluster organization B exhibits fewer classes (83 versus 155) of somewhat larger size (6.6 documents per class on the average versus 5.8 for cluster organization A); there is also much less overlap among the clusters (1.3 clusters per document versus 2.1). The classes are constructed by using a fast automatic tree-search algorithm due to Williamson. [4] A number of space density measures are shown in Table I(a) for the two cluster organizations, including the average similarity between the documents and the corresponding cluster centroids (factor x); the average similarity between the cluster centroids and the main centroid; and the average similarity between pairs of cluster centroids (factor y). Since a well-separated Table I. Effect of Performance Change on Space Density space corresponds to tight clusters (large x) and large differences between different clusters (small y), the ratio y/x can be used to measure the overall space density [5]. It may be seen from Table l(a) that all density measures are smaller for the indexing system based on inverse document frequencies; that is, the documents within individual clusters resemble each other less, and so do the complete clusters themselves. However, the "spreading out" of the clusters is greater than the spread of the documents inside each cluster. This ac- counts for the overall decrease in space density between the two indexing systems. The results of Table I(a) would seem to support the notion that improved recall- precision performance is associated with decreased density in the document space. The reverse proposition, that is, whether decreased performance implies increased space density, may be tested by carrying out term weighting operations inverse to the ones previously used. Specifically, since a weight- ing system in inverse document frequency order pro- duces a high recall-precision performance, a system which weights the terms directly in order of their docu- ment frequencies (terms occurring in a large number of documents receive the highest weights) should be cor- respondingly poor. In the output of Table l(b), a term weighting system proportional to (f,k. DFz.) is used, where f,: is again the term frequency of term k in document i, and DFk is defined as IO/(IDF)A.. The recall-precision figures of Table l(b) show that such a (a) Effect of performance improvement on space density Cluster organization A Cluster organization B (b) Effect of performance deterioration on space density Type of indexing Cluster organization A Cluster organization B (155 clusters; (83 clusters; (155 clusters; (83 clusters; 2.1 overlap) 1.3 overlap) 2.1 overlap) 1.3 overlap) Term Term Standard frequency Standard frequency term with term with frequency inverse frequency inverse weights doc. freq. weights doc. freq. (fi k) (fl k. IDFk) (fl k) (fi k" IDFk) Term Term Standard frequency Standard frequency term with term with frequency document frequency document weights frequency weights frequency (fi k) (fik'DFk) (fi k) (flk'DFk) Recall-precision output* -- +14% -- +14% -- - 10.1% -- - 10.1% Average similarity between documents and correspond- .712 .668 .650 .589 .712 .741 .650 .696 ing cluster centeroids (x) (- .044) (- .061 ) (+ .029) (+ .046) Average similarity between cluster centroids and .500 .454 .537 .492 .500 .555 .537 .574 main centroid (--.046) (--.045) (+ .055) (+ .037) Average similarity between pairs of cluster .273 .209 .315 .252 .273 .329 .315 .362 centroids (y) (--.046) (--.063) (+.056) (+.047) Ratio y/x .273/.712 .209/.668 .315/.650 .252/,.589 .273/.712 .329/.741 .315/.650 .362/.696 = .383 = .318 = .485 = .428 = .383 =.44z) = .485 = .520 (--19%) (--12%) (+16~o) (+7%) * From [2]. 616 Communications of the ACM November 1975 Volume 18 Number 11
`
`IPR2019-01304
`BloomReach, Inc. EX1020 Page 4
`
`

`

`weighting system produces a decreased performance of about ten percent, compared with the standard. The space density measurements included in Table I(b) are the same as those in Table I(a). For the in- dexing system of Table I(b), a general "bunching up" of the space is noticeable, both inside the clusters and between clusters. However, the similarity of the various cluster centroids increases more than that between documents inside the clusters. This accounts for the higher y/x factor by 16 and 7 percent for the two cluster organizations, respectively. 3. Correlation Between Space Density and Indexing Performance In the previous section certain indexing methods which operate effectively in a retrieval environment were seen to be associated with a decreased density of the vectors in the document space, and contrariwise, poor retrieval performance corresponded to a space that is more compressed. The relation between space configuration and re- trieval performance may, however, also be considered from the opposite viewpoint. Instead of picking docu- ment analysis and indexing systems with known per- formance characteristics and testing their effect on the density of the document space, it is possible to change the document space configurations artificially in order to ascertain whether the expected changes in recall and precision are in fact produced. The space density criteria previously given stated that a collection of small tightly clustered documents with wide separation between individual clusters should produce the best performance. The reverse is true of large nonhomogeneous clusters that are not well sepa- rated. To achieve improvements in performance, it would then seem to be sufficient to increase the simi- larity between document vectors located in the same cluster, while decreasing the similarity between different clusters or cluster centroids. The first effect is achieved by emphasizing the terms that are unique to only a few clusters, or terms whose cluster occurrence fre- quencies are highly skewed (that is, they occur with large occurrence frequencies in some clusters, and with much lower frequencies in many others). The second result is produced by deemphasizing terms that occur in many different clusters. Two parameters may be introduced to be used in carrying out the required transformations [5]: NC(k) : the number of clusters in which term k occurs (a term occurs in a cluster if it is assigned to at least one document in that cluster) ; and CF(k,j): the cluster frequency of term k in cluster j that is, the number of documents in cluster j in which term k occurs. For a collection arranged into p clusters, the average cluster frequency (CF(k)) may then be defined from CF(k, j) as (CF(k)) = (l/p) k CF(k,j). j=l Given the above parameters, the skewness of the occurrence frequencies of the terms may now be mea- sured by a factor such as Fa = I(CF(k)) -- CF(k,j) I. On the other hand, a factor Fz inverse to NC(k) [for example, 1/NC(k)] can be used to reflect the rarity with which term k is assigned to the various clusters. By multiplying the weight of each term k in each cluster j by a factor proportional to F1.F.2 a suitable spreading out should be obtained in the document space. Contrari- wise, the space will be compressed when a multiplicative factor proportional to 1/(FI.F2) is used. The output of Table II(a) shows that a modifica- tion of term weights by the F~.F.., factor produces pre- cisely the anticipated effect: the similarity between documents included in the same cluster (factor x) is now greater, whereas the similarity between different cluster centroids (factor y) has decreased. Overall, the space density measure (y/x) decreases by 18 and 11 percent respectively for the two cluster organizations. The average retrieval performance for the spread-out space shown at the bottom of Table II(a) is improved by a few percentage points. The corresponding results for the compression of the space using a transformation factor of 1/(Fx.F.2) are shown in Table ll(b). Here the similarity between documents inside a cluster decreases, whereas the simi- larity between cluster centroids increases. The overall space density measure (y/x) increases by 11 and 16 percent for the two cluster organizations compared with the space representing the standard term frequency weighting. This dense document space produces losses in recall and precision performance of 12 to 13 percent. Taken together, the results of Tables I and II indi- cate that retrieval performance and document space density appear inversely related, in the sense that ef- fective indexing methods in terms of recall and pre- cision are associated with separated (compressed) document spaces; on the other hand, artificially generated alterations in the space densities appear to produce the anticipated changes in performance. The foregoing evidence thus confirms the usefulness of the "term discrimination" model and of the auto- matic indexing theory based on it. These questions are examined briefly in the remainder of this study. 4. The Discrimination Value Model For some years, a document indexing model known as the term discrimination model has been used experi- 617 Communications November 1975 of Volume 18 the ACM Number 11
`
`IPR2019-01304
`BloomReach, Inc. EX1020 Page 5
`
`

`

`Table II. Effect of Cluster Density on Performance (a) Effect of low cluster density (b) Effect of high cluster density on performance on performance Cluster organization A Cluster organization B Cluster organization A Cluster organization B (155 clusters; (83 clusters; (155 clusters; (83 clusters; 2.1 overlap) 1.3 overlap) 2.1 overlap) 1.3 overlap) Standard Low Standard Low Standard High Standard High cluster cluster cluster cluster cluster cluster cluster cluster density density density density density density density density (emphasis (emphasis (emphasis (emphasis on low on low on high on high (term frequency (term frequency (term frequency (term frequency frequency and skewed frequency and skewed frequency and even frequency and even weights) terms) weights) terms) weights) terms) weights) terms) Average similarity between documents and their .712 centroids (x) Average similarity between cluster centroids and .500 main centroid Average similarity between .273 pairs of centroids (y) Ratio y/x .273/.712 =. 383 Recall-precision comparison .730 .650 .653 .712 .681 .650 .645 (+ .018) (+ .003) (-- .031) (-- .005) .477 .537 .528 .500 .523 .537 .531 (-- .023) (-- .009) (+ .023) (-- .006) .229 .315 .281 .273 .290 .315 .364 (-- .044) (-- .034) (+ .017) (+ .049) .229/.730 .315/.650 .281/.653 .273/.712 .290/.681 .315/.650 .364/.645 = .314 = .485 = .430 = .383 = .426 = .485 = .561 (--18%) (--11%) (+11%) (+16%) +2.6% -- +2.3% -- --12.4% -- --13.3% mentally. [2, 6] This model bases the value of an index term on its "discrimination value" DV, that is, on an index which measures the extent to which a given term is able to increase the differences among document vectors when assigned as an index term to a given col- lection of documents. A "good" index term, that is, one with a high discrimination value, decreases the similarity between documents when assigned to the collection, as shown in the example of Figure 5. The reverse obtains for the "bad" index term with a low discrimination value. To measure the discrimination value of a term, it is sufficient to take the difference in the space densities before and after assignment of the particular term. Specifically, let the density of the complete space be measured by a function Q such as that of eq. (3); that is, by the sum of the similarities between all docu- ments and the space centroid. The contribution of a Fig. 5. Operation of good discriminating term. Before Assignment of Term x After Assignment of Term X Document 0 Main Centroid 618 Table llI. Terms in Discrimination Value Order (1963 Time Magazine) Good terms Poor terms 1. Buddhist 7560. Work 2. Diem 7561. Lead 3. Lao 7562. Red 4. Arab 7563. Minister 5. Viet 7564. Nation 6. Kurd 7565. Party 7. Wilson 7566. Commune 8. Baath 7567. U.S. 9. Park 7568. Govern 10. Nenni 7569. New given term k to the space density may be ascertained by computing the function D V~. = Qk- Q, (4) where Qk is the compactness of the document space with term k deleted from all document vectors. If term k is a good discriminator, valuable for content identifi- fication, then Qk > Q, that is, the document space after removal of term k will be more compact (because upon assignment of that term to the documents of a collection the documents will resemble each other less and the space spreads out). Thus for good discrimina- tors Qk - Q > 0; the reverse obtains for poor discrimi- nators, for which Qk - Q < 0. Because of the manner in which the discrimination values are defined, it is clear that the good discriminators must be those with uneven occurrence frequency dis- tributions which cause the space to spread out when Communications November 1975 of Volume 18 the ACM Number 11
`
`IPR2019-01304
`BloomReach, Inc. EX1020 Page 6
`
`

`

`assigned by decreasing the similarity between the indi- vidual documents. The reverse is true for the bad dis- criminators. A typical list including the ten best terms and the ten worst terms in discrimination value order (in order by the Qk -- Q value) is shown in Table III for a collection of 425 articles in world affairs from Time magazine. A total of 7569 terms are used for this collection, exclusive of the common English function words that have been deleted. In order to translate the discrimination value model into a possible theory of indexing, it is necessary to examine the properties of good and bad discriminators Fig. 6. Average discrimination value rank of terms. Discrimination Rank Average Term 5000' 4000 Medlars Collection 450 Documents i 4726 Terms 3000 2000 I OOO / 500 I s 5 "t e it is 15-16 19-20Z3-2530"-3239-44~69-156 3845 675 terms 206 terms terms Document Frequency in greater detail. Figure 6 is a graph of the terms as- signed to a sample collection of 450 documents in medi- cine, presented in order by their document frequencies. For each class of terms--those of document frequency 1, document frequency 2, etc.--the average rank of the corresponding terms is given in discrimination value order (rank 1 is assigned to the best discriminator and rank 4726 to the worst term for the 4726 terms of the medical collection). Fig. 6 shows that terms of low document frequency, i.e. those that occur in only one, or two, or three docu- ments, have rather poor average discrimination ranks. The several thousand terms of document frequency 1 have an average rank exceeding 3000 out of 4726 in discrimination value order. The terms with very high document frequency, i.e. at least one term in the med- ical collection occurs in as many as 138 documents out of 450, are even worse discriminators; the terms with document frequency greater than 25 have average dis- crimination values in excess of 4000 in the medical col- lection. The best discriminators are those whose docu- ment frequency is neither too low nor too high. The situation relating document frequency to term discrimination value is summarized in Figure 7. The 4 percent of the terms with the highest document fre- quency, representing about 50 percent of the total term assignments to the documents of a collection, are the worst discriminators. The 70 percent of the terms with the lowest document frequency are generally poor dis- criminators. The best discriminators are the 25 percent whose document frequency lies approximately between 17/100 and tl/10 for n documents. If the model of Figure 7 is a correct representation of the situation relating to term importance, the fol- lowing indexing strategy results [6, 7]: (a) Terms with medium document frequency should be used for content identification directly, without further transformation. (b) Terms with very high document frequency should be moved to the left on the document frequency spectrum by transforming them into entities of lower frequency; the best way of doing this is by taking high- frequency terms and using them as components of indexing phrases--a phrase such as "programming language" will necessarily exhibit lower document Fig. 7. Summarization of discrimination value of terms in frequency ranges. left -to- Right Ri ght- to - Left Recoil Improving Precision Improving IP 4 Poor Worst Discriminators Best Discrimlnotors Discriminators [ \\\\\\,. I ///,,.,..////////// ]1 I I I I I I I I I i O nllO0 ollO hi2 70°1o of Terms 26% of Terms 4% of Terms (5,0% of Term assignments) Document ;~ Frequency (n Documents in oil) 619 Communications of the ACM November 1975 Volume 18 Number 11
`
`IPR2019-01304
`BloomReach, Inc. EX1020 Page 7
`
`

`

`frequency than either "program", or "language" alone. (c) Terms with very low document frequency should be moved to the right on the document frequency spectrum by being transformed into entities of higher frequency; one way of doing this is by collecting several low frequency terms that appear semantically similar and including them in a common term (thesaurus) class. Each thesaurus class necessarily exhibits a higher document frequency than any of the component mem- bers that it replaces. The indexing theory which consists in using certain elements extracted from document texts directly as index terms, combined with phrases made up of high frequency components and thesaurus classes defined from low frequency elements has been tested using document collections in aerodynamics (CRAN), medi- cine (MED), and world affairs (TIME) [2, 6, 7]. A typical recall-precision plot showing the effect of the right-to-left phrase transformation is shown in Figure 8 for the Medlars collection of 450 medical documents. Fig. 8. Average recall-precision comparison for phrases (Medlars: 450 documents, 24 queries). 1.0 ,9 .8 .7 .6 .5 .3 ,2 ,1 Precision o~ x o k o \ "°--.o - \\ "--..< Standard Phrase Term Assignment Prequency o R Preeision 0.1 .7891 0.2 .6750 0.3i .5481 0.4 .4807 0.5 .438~ 0.6 .3721 0.7 .3357 0,8 .2195 0,9 .1768 1.0 ,1230 Average Improvement .8811 .8149 .6992 .6481 .5930 .5450 .4867 .3263 .2767 .1969 + 39% t I i ' 2 Recall .1 ,2 .3 .q .5 ,6 .7 .8 ,9 l.O Table IV. Summary of Recall-Precision Evaluation (Three Collections) CRAN 424 MED 450 TIME 425 Automatic phrases vs. standard term frequency -k- 32% -t- 39% q- 17% Automatic phrases plus thesaurus vs. standard run +33% +50% +18% Best Precision low recall 0.89 0.88 0.85 medium recall 0.43 0.61 0.70 high recall 0.13 0.23 0.45 When recall is plotted against precision, the curve closest to the upper right-hand corner of the graph (where both recall and precision are close to 1) re- flects the best performance. It may be seen from Figure 8 that the replacement of the high frequency nondis- criminators by lower frequency phrases improves the retrieval performance by an average of 39 percent (the precision values at the ten fixed recall points are greater by an average of 39 percent). The performance of the right-to-left (phrase) trans- formation and left

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket