`
`Carolyn J. Crouch
`Department of Computer Science
`Tulane University
`New Orleans, LA 70118
`
`The importance of a thesaurus in the successful operation of an information
`retrieval system is well recognized. Yet techniques which support ithe automatic
`generation of thesauri remain largely undiscovered. This paper describes one
`approach to the automatic generation of global
`thesauri, based on the
`discrimination value model of Salton, Yang, and Yu and on an appropriate
`clustering algorithm. This method has been implemented and applied to two
`document collections. Preliminary results indicate that this method, which produces
`improvements in retrieval performance in excess of 10 and 15 percentin the test
`collections, is viable and worthy of continued investigation.
`
`INTRODUCTION
`
`A major factor in the successful operation of any information retrieval system is
`the set of dictionaries available for use by the system. Of these dictionaries (e.g.,
`thesauri, Statistical and/or syntactic phrase dictionaries, term hierarchies, etc.), the
`dictionary having the greatest potential
`impact on system performance is
`undoubtedly the thesaurus. Although the benefits accruing from the use of a well
`constructed thesaurus in terms of increased system performance are well
`recognized, the methodology for automatically creating such a thesaurus remains
`unspecified.
`In fact, virtually all thesauri presently in Use are idiosyncratic.
`Thus a topic of considerable interest to researchers aiming to improve the
`overall performance of
`information retrieval systems is automatic thesaurus
`construction. This paper describes an approach to the auiomatic construction of a
`global thesaurus based on the discrimination value model of Salton, Yang, and Yu
`[SALTON7Sa] and on an appropriate clustering algorithm. The discrimination value
`model itself is based on the vector space model described below
`
`
`
`Permission to copy without fce all part of this material is granted provided that the copies are not made ordistri-
`buted for direct commercial advantage, the ACM copyright notice and the title of the publication and its date
`appear, and notice is given that copying is by permission of the Association for Computing Machinery, To copy
`otherwise, or to republish, requires a fee and/or specific permission.
`
`Cc
`
`1988
`
`
`
`
`
`0-8979 1-274-8 88 0600 0309ACM $ 1,50
`
`
`
`
`
`
`
`—309—
`
`(cid:20)
`1
`
`PETITIONERS- EXHIBIT 1022
`PETITIONERS- EXHIBIT 1022
`IPR2022-00217
`
`IPR2022-00217
`
`
`
`The Vector Space Modei
`
`One of the major models in information retrieval is the vector space model.
`This model views each document in the documentcollection as a set of unique
`words or word types. Each document may then be regarded as a term vector, and
`the complete documentcollection becomes a vector space of dimension m, where
`m is the number of word types in the collection.
`In the vector space model, a
`document vector, dj, is represented by a set of lerms, djk,
`1 < K=m, where djk
`represents the frequency (or weight) of term k in document j (i.e., the number of
`times term k appears in documentj).
`If djk = 0, term k does not appear in document
`dj. Queries, like documents, are represented by weighted term vectors.
`Given any two term vectors,
`the similarity between the vectors may be
`assumed to be inversely related to tre angle between them.
`lf the two vectors
`coincide, the angle between them is zero, ard the vectors are identical.
`In two
`dimensions,
`the vector space may be represented by its envelope.
`The
`(normalized) vectors are then viewed as points in the vector space, and the
`distance between any two points is inversely related to the similarity of the
`corresponding document vectors. The smaller the distance between twopoints, the
`smaller the angle between the corresponding vectors, and the greater the similarity
`of the vectors in terms of the number of word types they have in common.
`Salton et al [SALTON75a, SALTON75b, SALTON76] have shown that the
`best document space for retrieval purposes is one which maximizes the average
`separation between documentsin the document space.
`In this space, itis easier to
`distinguish between documents and thus easier to relrieve documents which are
`most similar to a given query. The model! which allows the terms in a collection to
`be ranked in order of their effect on space density is called the discrimination value
`model.
`
`The Discrimination Value Model
`
`[SALTON75a] assigns specific roles to single
`The discrimination value model
`terms, term phrases, and term classes for content analysis purposes and provides a
`framework within which each potential index term in a collection can be ranked in
`accordance with its usefulness as a document discriminator.
`It also offers a
`
`reascnable physical interpretation of the indexing process.
`lf we consider a collection of documents, each represented by a set of
`weighted m-dimensional vectors, then the similarity coefficient computed between
`any iwo term vectors can be interpreted as a measure of the closeness or
`relatedness between the vectors in m-space.
`If the similarity coefficient is large, the
`documentsare very similar and appear in close proximity to each other tn the
`document space. And if the similarity coefficient is small, the documents exhibit little
`similarity and are widely separated in the document space.
`
`-310
`
`(cid:21)
`
`
`
`The discrimination value of a term is then defined as a measure of the change
`in space separation which occurs when a given term is assigned to the document
`collection. A good discriminator is a term which, when assigned to a document,
`decreases the space density (i.¢., renders the documents less similar to each
`other). Conversely, the assignment of a poor discriminator increases the space
`density. By computing the density of the document space before and after the
`assignment of each term, the discrimination value of the term can be determined.
`The terms can then be ranked in decreasing order of their discrimination values.
`Salton, Yang, and Yu [SALTON?75a] have used discrimination value to
`determine three categories of discriminators, namely, good, poor, and indifferent
`discriminators, A term with a positive discrimination value has been found to be a
`good discriminator. Salton et a/ suggest that these terms be used directly as index
`terms, Those terms with negative discrimination values are poor discriminators; the
`retrieval properties of such terms can be transformed by including them in
`appropriate phrases. The majority of terms are indifferent discriminators with near-
`zero discrimination values. The retrieval capabilities of these terms can be
`enhanced by means of their incorporation in appropriate thesaurus classes.
`Thus the discrimination value model presents a criterion for the formation of
`global thesauri. According to this model, a thesaurus is composed of aset of
`thesaurus classes. A thesaurus class is composed of a group of terms. or word
`types. The terms within a class should be indifferent discriminators (i.e., those with
`near-zero discrimination values). Thus in order to use the criterion suggested by
`the discrimination model, the discrimination value of each term in the collection
`must be calculated and the terms ranked as good,
`indifferent and poor
`discriminators according to their discrimination vaiues.
`But the calculation of discrimination value is normally expensive. Two different
`approaches have been used. One approach, the so-called exact method, involves
`the calculation of all pairwise similarities between the document vectors of the
`collection. For a callection of n documents and m word types, the complexity of this
`algorithm is O(mn2). The second or approximate approachto the calculation of
`discrimination value involves the construction of an artificial, average document, the
`centroid, and computes the sum of the similarities of each document with the
`centroid. The centroid algorithm is O(mn).
`Modifications have been suggested which improve the execution times
`associated with both the exact and the approximate methods of calculating
`discrimination value [CRAWFORD75, WILLET8&5, CROUCH88j. Although
`the
`discrimination values produced by these two approaches differ significantly for a
`particular collection, it has been shown that the rankings of the terms are in fact
`highly compatible [CROUCH88}j. Thus of these two methods, the more efficient,
`centroid approach is the obvious method of choice when discrimination values must
`be calculated.
`
`—311—
`
`(cid:22)
`
`
`
`But for a documentcollection of any size, even the centroid approach to the
`calculation of discrimination value may be expensive. A reasonable alternative has
`been provided by Salton, Yang, and Yu, who suggest the use of document
`frequency aS an approximation to discrimination value.
`For any term k in a
`- document collection,
`the document frequency of term kK, dk,
`is defined as the
`number of documents in which term kK appears. Empirical resuits indicate that
`document frequency and discrimination value are strongly correlated. Let n
`represent the number of documents in a collection whose terms are ranked by
`increasing document frequency. According to [SALTON/75a], those terms whose
`document frequency is less than n/100 may he considered low frequency terms.
`The discrimination values of these terms are normally near-zero, and these terms
`as a whole may be considered indifferent discriminators. Likewise, terms whose
`document frequencies are greater than n/10 may be considered high frequency
`terms. These terms normaily have negative discrimination values and are
`considered poordiscriminators. The remaining terms (i.e., n/10 < dk s n/100) make
`up the set of good discriminators. The discrimination values of these terms are
`positive.
`Thus document frequency may be used as an approximation to discrimination
`value. Thesaurus classes, which thecretically should consist of groupings of terms
`with near-zero discrimination values, may instead be constructed of sets of low
`frequency terms. Since document frequencyis readily available for every term ina
`collection, the cost associated with this approach is minimal.
`
`AN APPROACH TO THESAURUS CONSTRUCTION
`
`An experiment was designed to investigate the feasibility of constructing a
`global thesaurus based on low frequency terms. The term "global thesaurus" is
`used to differentiate this type of thesaurus from the "local thesaurus” described by
`Attar and Fraenkel [ATTAR77].
`In a global approach, thesaurus classes, once
`constructed, are used to index both documents and queries. The local thesaurus,in
`contrast, uses information obtained fram the documents retrieved in response to a
`particular query to modify that query, which \s then resubmitted to the retrieval
`system for processingin lieu of the original. Thus a global thesaurus is constructed
`prior to the indexing process and the thesaurus classes are used to index both
`documents and queries, whereas a local thesaurus is constructed dynamically
`during query processing and uses information retrieved in response to a specific
`query to modity only that query.
`
`Constructing Thesaurus Classes
`
`Constructing a global thesaurus based on the discrimination value model calls
`for the generation of thesaurus classes consisting of indifferent discriminators or (as
`a viable alternative) low frequency terms. The question of.how the classes
`
`=312
`
`(cid:23)
`
`
`
`Intuitively, a thesaurus class
`themselves are to be constructed remains open,
`should consist of terms which are closely related in the context of the current
`collection.
`(Such terms are not necessarily synonyms in the conventional sense.)
`One approach to generating groups of closely related terms is to cluster all the
`terms in the collection. The low frequency terms which cluster together might then
`be considered a thesaurus class. Unfortunately, in an environment where there is
`little information to exploit, the resultant clusters are seldom meaningful, This is the
`case in the low frequency domain, where terms are contained in only a small
`number of the documents in the collection.
`
`An alternative approachis to cluster the documents of the collection and to
`generate thesaurus classes from the low frequency terms contained in the
`document clusters. A key question then arises, namely, what type of clustering
`algorithm should be used? The choice is dictated largely by the type of clusters an
`algorithm produces.
`In order to generate meaningful thesaurus classes,
`the low
`frequency terms in a class should come from closely related documents. This
`implies that the document clusters themselves should be small and tight. An
`algorithm which produces clusters of this type is the complete-link clustering
`algorithm, one of a class of agglomerative, hierarchical clustering algorithms that
`has received some attention in the literature
`[VANRIJSB79, VOORHEESS85,
`VOORHEESS6]. Consequently,
`this was the algorithm used to cluster the
`documentsin our test collections.
`
`Constructing a Global Thesaurus
`
`The following procedure has been utilized to construct a global thesaurus:
`1. The documentcollection is clustered via the complete-link algorithm.
`2. The resultant hierarchy is traversed and thesaurus classes are generated,
`based on specified, user-supplied parameters.
`3. The documents and queries are indexed by the thesaurus classes.
`The characteristics of
`the thesaurus classes generated in step 2 are
`determined by the following, user-supplied parameters:
`
`(a)
`
`THRESHOLD VALUE
`
`Application ‘of the complete-link clustering algorithm produces a
`hierarchy in which the tightest clusters (i.e., those which cluster at the
`highest threshold values) lie at the bottom of the cluster tree. These nodes
`are the leaves of the tree. For example, consider Fig. 1. The squares
`represent documents and the numbersin the circles represent the levels at
`which the documents cluster. Documents A and B cluster at a threshold
`value of 0.089, D and E cluster at a level of 0.149, and document C clusters
`with the D-E subtree at a level of 0.077. The A-B subtree and the C-D-E
`subtree cluster at a threshold value of 0.029.
`The user-supplied threshold value largely determines the documents
`from which terms are selected for inclusion in a thesaurus class.
`In Fig. 1, a
`threshold value of 0.090 would return only the D-E documentcluster, since
`
`—313-
`
`(cid:24)
`
`
`
`
`
`\
`
`(2
`
`Fig. 1 A sample hierarchy generated by the complete link algorithm
`
`
`
`the level at which the documents cluster must be greater than or equal to
`the specified threshold in order for the cluster to be recognized. A
`threshold value of 0.085, on the other hand, wouid return two clusters (A-B
`and D-E).
`lf a threshold value of 0.075 were specified, two clusters would
`be returned: A-B and either D-E or C-D-E.
`
`(0) NUMBER OF DOCUMENTSIN A CLUSTER
`
`Since both D-E and C-D-E meet the threshold value of 0.075 in Fig. 1,
`another parameter is needed to determine the number of documents to be
`included in the final cluster. This parameter, supplied by the user, is the
`number of documents in a cluster.
`In the example, a value of 2 for this
`parameter would insure that only the D-E cluster would be returned. A
`value of 3 or greater would allow the entire C-D-E cluster to be returned.
`
`(Cc) MINIMUM DOCUMENT FREQUENCY
`
`Once the document clusters have been selected, thesaurus classes
`are formed from the low frequency terms contained in those clusters.
`Before such terms can be selected, the minimum document frequency of
`each term, i.e., the criterion by which terms are determined to be “low
`frequency,” must be specified. The guidelines in [SALTON7S5Sa] may be
`Kelpful in setting this parameter.
`
`(d)
`
`SPECIFICATION OF CLASS FORMATION METHOD
`
`When the documentclusters have been determined (via the settings of
`the first two parameters) and the low frequency terms from the clustered
`documents have been gathered, the thesaurus classes themselves can be
`generated. The simplest approach is to define a thesaurus class as the
`intersection of all the low frequency terms in a cluster. Alternate methods,
`using (1) union and (2) restricted intersection (ranking the terms in the
`intersection and using the top ranked terms from that set) in lieu of the
`intersection, were also implemented.
`to be the most
`The most successful approach turned out
`straightforward. The best
`results were obtained by using simple
`intersection. The experimental results described in this paper are based on
`this approach.
`
`--314—
`
`(cid:25)
`
`
`
`Once the thesaurus classes are generated, a collection can be indexed by
`replacement or augmentation (see step 3 above). Replacement entails replacing a
`term in a vector by a thesaurus class which contains that term. Augmentation
`involves adding the thesaurus class to the vector; the original term remains in the
`vector, although its weight may be modified.
`(For a discussion of term weighting
`methods, see [SALTON87].) Both methods were implemented in this experiment.
`The results obtained by augmentation are in every case vastly superior to those
`obtained through replacement , and thus this discussion focuses on indexing by
`augmentation.
`Another factor of some interest during the indexing process is the weighting of
`the new thesaurus classes. The classes themselves must be weighted so as to
`ensure that they neither override the original vector terms nor have a negligible
`effect during retrieval. An intuitively appealing
`approach is to compute the
`thesaurus class weight as a function of the terms contained in that class. As a
`result, the thesaurus class weights in this experiment were computed according to
`the following formula, where tc_wt denotes the thesaurus class weight, tc_con_wtj
`denotes the weight of term i in this thesaurus class, |tc_con| represents the number
`of terms in the thesaurus class, and avg_tc_con_wt represents the average weight
`of aterm in the thesaurus class. Then
`
`and
`
`avg_tc_con_wt =
`
`|jtc_con|
`>) te_con_wti
`i=
`[tc_conl
`
`tc_wt = avg_tc_con_wt
`~
`Jtc_con]
`
`, 0.5
`
`Thus a thesaurus class weight is computed by dividing the average weight of a term
`in this class by the number of terms in the class and downweighting that value by
`0.5. The experimental results indicate that this approach is indeed effective.
`
`Methodology
`
`THE EXPERIMENT
`
`Global thesauri were constructed for two test collections using the procedure
`described in the previous section. To recap:
`the collections were clustered using
`the complete-link algorithm. The hierarchies were traversed and thesaurus classes
`were generated based on user-supplied parameters (threshold value, numberof
`documents in a cluster, minimum documentfrequency, and specification of class
`formation method). The documents and queries were then indexed by the
`thesaurus classes.
`In all cases, the indexing was performed by augmentation, and
`the thesaurus class weights were generated based on the weighting scheme
`
`—315—
`
`(cid:26)
`
`
`
`detailed in the previous section, The weights of
`modified.
`
`the original terms were not
`
`Setting the Parameters
`
`With reference to the parameters used in step 2 of the thesaurus construction
`procedure:
`(a) Threshold vaiue is collection-dependent. After a collection is clustered,
`appropriate threshold values may be determined by examining the cluster tree. The
`setting of the threshold value is the most vital factor in forming the thesaurus
`classes. This setting in conjunction with the class formation method determines the
`size of the thesaurus class.
`If the setting is too high, thesaurus classes will be
`generated with either too many terms (using union for class formation) or with too
`few terms (using intersection).
`lf the threshold is too low, very few classes will be
`generated.
`(b) The number of documents in a cluster obviously also affects thesaurus
`class size, but the overall effect of this parameter is considerably less than that of
`the threshold value. An initial prernise of this experiment is that terms in a
`thesaurus class should come from closely related documents, which in turn implies
`that ihe document clusters themselves must be small and tight. For investigative
`purposes this parameter was allowed to range from 2 to 5 in our test runs, butit is
`always subject to (j.e., governed by) the threshold value and in most cases, the
`actual number of documents in a cluster ranges from 2 to 3.
`(sc) Minimum document
`frequency can be set by using the guidelines
`suggested in [SALTON7S5a].
`(d) Class formation method is the parameter which specifies exactly how the
`thesaurusclass itself is formed once the document cluster has been determined. Of
`
`the three approaches investigated (union, intersection, and restricted intersection),
`simple intersection gives the best results in all cases. Thus a thesaurus composed
`of thesaurus classes consisting of sets of
`low frequency terms common to a small
`group of closely related documents appears to have the greatest potential for
`effecting an improvement in system performance.
`
`Collection Weighting
`
`thesaurus has been constructed and the collection
`Once the global
`(documents and queries) indexed by the thesaurus classes, the vectors are then
`weighted appropriately, using facilities made available through the Smart
`experimental retrieval system ([BUCKLEY85].
`(The original vectors, before
`modification, are weighted by simple term frequency.) Smart offers a number of
`options with respect to weighting. The most common approach is probably to
`weight the vectors using "tf-idf" weights (i.e., term frequency multiplied by inverse
`document frequency). This scheme is appealing because it allows a term to be
`weighted by a factor dependent upon both its importance within a document(its
`--316—
`
`(cid:27)
`
`
`
`term frequency) and its importance within the document collection as a whole (its
`Gocument frequency). Moreover, tf-idf weighting has proved most effective in
`comparison with other weighting schemes in numerous experiments.
`The collection weighting method usedin this experimentis a variation of the ti-
`idf scheme, wherein the term frequency componentis referred to as “augmented
`normalized term frequency.” In this method, the term frequency is normalized and
`augmented and is thus guaranteed to lie between 0.5 and 1.0. This component
`then replaces the tf componentin the tf-idf scheme to generate the weight forthis
`term in the modified document collection.
`
`Similarity Measure
`
`A common measure of the similarity between two vectors dj and dk is the
`cosine function
`
`dj-edk
`cos(dj, dk) = —_i-s__
`8h te)=Tait td KI
`
`where dj*d is the inner product and ||dj \|2 = (dj -dj). This is the measure used to
`compute similarity in this experiment.
`
`Collection Characteristics
`
`The thesaurus construction procedure described in this paper was run on two
`standard test collections, namely, ADI and Medlars. The characteristics of these
`collections are found in Table |.
`(For more complete descriptions of these
`collections, see [FOX83A, FOX83b, VOORHEES85]}.)
`
`Table |
`
`Collection Statistics
`
`Collection
`
`Number of
`Vectors
`
`Number of
`Terms
`
`Mean Numberof
`Terms per Vector
`
`ADI
`
`Medlars
`
`82
`
`1033
`
`822
`
`6927
`
`25.5
`
`51.6
`
`—317—
`
`(cid:28)
`
`
`
`Empirical Results
`
`The results of the experiment, as applied to the ADI and Medlars collections,
`are displayed in Table Il. The normal evaluation measures, precision and recall,
`are used.
`(Recall is defined as the proportion of relevant material retrieved, while
`precision is the proportion of retrieved material that is relevant.) Part (a) gives the
`statistics in terms of precision and recall when the thesaurus generation procedure
`is applied to the ADI collection (using in this case the following parameters: a
`threshold level of 0.075, 5 documents per cluster, a minimum document frequency
`of 20, and intersection as the class formation method). The Statistics which apply to
`the original, unmodified AD! collection are found in column 1. Column 2 pertains to
`the modified ADI collection after application of the global thesaurus. The average
`precision at 3 points of recall (0.25, 0.50, and 0.75) is calculated for each collection
`and shows an overall improvement oc! 10.6 per cent when the global thesaurus is
`applied.
`Part (b) gives a similar view when the procedure is applied to the Medlars
`collection (with corresponding parameter settings of 0.120, 3, 50, and intersection).
`Again, column 1 applies to the original Medlars collection and column 2 applies to
`same collection after application of the global thesaurus. The improvementin
`precision (based on the 3 point average) of the modified collection over the original
`is 15.8 per cent.
`
`SUMMARY
`
`An experiment was designed to investigate the feasibility of constructing a
`global thesaurus based on the discrimination value model and an appropriate
`clustering algorithm.
`A thesaurus construction procedure was designed,
`implemented, and applied to two standard test collections. The preliminary results
`show that substantial
`improvements can be effected by making a properly
`constructed global thesaurus available to a retrieval system during processing.
`Moreover, as the empirical results indicate,
`it
`is possible to construct such a
`thesaurus automatically. The question of whether it is possible to construct such a
`thesaurus for other collections remains to be answered.
`
`—318—
`
`(cid:20)(cid:19)
`10
`
`
`
`TABLEII
`A Comparison of Retrieval Statistics:
`The Original versus the Thesaurus-Based Collection
`
`eee
`
`collection: adi
`
`threshold:
`
`.o75
`
`ho, doce: 6
`
`collection: ned
`
`threshold;
`
`.120
`
`no. doce: 8
`
`Recall - Precistos
`Level
`i
`0.00
`0.6668
`0,05
`0.8858
`0.10
`0.6858
`0.16
`0.8778
`0.20
`o.e6214
`0.26
`0.8507
`0.30
`0,6497
`0.35
`0.6020
`0.40
`0.4506
`0.46
`0.4700
`0.50
`0.4708
`0.65
`0.5900
`0.80
`0.3800
`0.06
`0,3513
`0.790
`o.28e68
`0,75
`0.2676
`0.80
`0.2870
`9,85
`90,2284
`0.00
`00,2242
`0.05
`0.2238
`1.090
`0,2236
`
`2
`0.7074
`O.70T4
`o.701T
`9.a012
`9.0548
`0.6885
`0.6758
`0.5598
`0.6304
`O.8201
`0.5160
`0.4291
`0.4264
`o,41e2
`0.3372
`o.3372
`0.3016
`o,2982
`0.2046
`0.2837
`0.2857
`
`Recall - Precialon
`Lerel
`1
`9.00
`0.0205
`0.06
`0,8427
`0.10
`0.7005
`0.15
`0.7582
`0.20
`0.7444
`0.24
`‘0, 7247
`9.30
`0.7009
`0.35
`0, 8801
`D.40
`o.8¢88
`O.45
`0.50090
`6.50
`©.5744
`O.55
`0.5219
`0.80
`D.4025
`0.05
`o.4407
`0.706
`0.41468
`0.76
`o.se7o
`0.60
`0, 3487
`0.85
`0.2015
`6.00
`0.2259
`0.05
`0.1369
`1.00
`0.0958
`
`0.0047
`0.9276
`0O.B870¢0
`0.6450
`0.8197
`0.6058
`O.7017
`0.7619
`O.7184
`O.8910
`0.8549
`0.6808
`0.8000
`0.5516
`0.5287
`o.4¢92
`0.46508
`0.4106
`0.30135
`O.1BeT
`O.1654
`
`Average precision for 3 intermediate pointes
`Precs
`0.4353
`0.4516
`A Prac Change =
`10,6
`2
`i
`Statistic
`0.7040 O.8016
`Norm Racall
`O.6311 O.6516
`Nora Precis
`0.1671 0.1862
`Rank Recall
`0,3845 06,4600
`Log Procia
`Precis after 10 docy ©.2257 ©0,2200
`Precia after 20 doce O,1162 O.117t
`Recall after 10 doce O.5524 O.5364
`Recall after 30 docs O.7870 O,7761
`E, 9.6, 10 docs
`O.7601 O.7édZ9
`BE, 1.0. 10 doca
`0.7165 O.71ié9
`E, 2.0, 10 doce
`0.8262 0.6259
`£, 0,6, 30 doco
`0,665)
`90,8038
`E, 1,0, 30 docs
`0,8140 06,6122
`£, 2.0, 30 doce
`0.0790 O.a772
`Husber queries
`S36
`55
`
`Average precision for 5 interazsdlate pointe
`Prec=
`6.6045
`0.6558
`§ Prec Change =
`16.8
`2
`i
`Statistic
`0.0274 0.0575
`Mors Recall
`0.7982 0.8503
`Norm Precis
`0.2149 0.8044
`Rant Ascall
`0.7078 0.7837
`Log Precis
`Precis after [O docs O.4200 9.0887
`Precis efter 30 docs 9.4500 0.5166
`Recall after 10 docs O.3003
`9.33900
`Reca}l after 30 doce O.8117 O.6031
`E. 0.5, 10 docs
`0.6050
`0.46547
`E, 1.0, 10 docs
`0.4088 O.5603
`E, 2.0, 10 docs
`0.6703 O.6876
`E. 0.5, 30 doce
`0.6350 0.4687
`E, 1.0, 30 docs
`0.4908 0.4250
`E, 2.0, 30 doca
`0.4456 0.3005
`Nuaber querics
`30
`30
`
`(a) ADI
`
`(b) Medlars
`
`—319—
`
`(cid:20)(cid:20)
`11
`
`
`
`REFERENCES
`
`[ATTAR77]
`
`Attar, R., and A.S. Fraenkel, Local Feedback in Full-text Retrieval
`Systems, Journal of fhe ACM, 24(3):397-417, 1977.
`
`[BUCKLEY85]
`
`Implementation of the Smart Iniormation Retrieval
`Buckley, C.,
`System, Tech. Report 85-685, Dept. of Computer Science,
`Cornel) University, May 1985.
`
`[CRAWFORD75]
`
`Crawford, R., The Computation of Discrimination Values,
`Information Processing and Management, 11:249-253, 1975.
`
`[CROUCH88]
`
`An Analysis of Approximate Versus Exact
`Crouch, C.,
`Discrimination Values, Information Processing and Management,
`24(1):5-16, 1988.
`
`[FOX83a]
`
`[FOX83b]
`
`Fox, E., Extending the Boolean and Vector Space Models of
`Information Retrieval with P-norm Queries and Multiple Concept
`Types, Ph.D. Thesis, Carnell University, Aug. 1983.
`
`Fox, E., Characteristics of Two New Experimental Collections in
`Computer and Information Science Containing Textual and
`Bibliographic Concepts, Tech. Report 83-561, Dept. of Computer
`Science, Cornell University, Sept. 1983.
`
`(SALTON75a]
`
`Salton, G., C.S. Yang, and C.T. Yu, A Theory of Term Importance
`in Automatic Text Analysis,
`Journal of the AS/S , 26(1):33-44,
`1975.
`
`[SALTON75b]
`
`Salton, G., A. Wong, and C. S. Yang, A Vector Space Mode!for
`Automatic Indexing, Communications of the ACM, 18:613-20,
`1975,
`
`[SAL-TON76]
`
`[SALTON87]
`
`Salton, G., and A. Wong, On the Role of Words and Phrases in
`Automatic Text Analysis, Computers and Humanities, 10:69-87,
`1976.
`
`Salton, G. and C. Buckley, Term Weighting Approaches in
`Automatic Text Retrieval, Tech. Report 87-881, Dept. of Computer
`Science, Cornell University, Nov. 1987.
`
`([VANRIJSB79]
`
`Information Retrieval, 2nd edition,
`van Rijsbergen, C.J.,
`Butterworths, London, England, 1979.
`
`[VOORHEESS5]
`
`Voorhees, E., The Effectiveness and Efficiency of Agglomerative
`Hierarchic Clustering in Document Retrieval, Tech. Report 85-
`705, Dept. of Computer Science, Cornell University, Oct. 1985.
`
`[VOORHEESS6]
`
`Implementing Agglomerative Hierarchic Clustering
`Voorhees, E.,
`Aigorithms for Use in Document Retrieval, Tech. Report 86-765,
`Dept. of Computer Science, Cornell University, July 1986.
`
`[WILLET85]
`
`Willet, P., An Algorithm for the Calculation of Exact Term
`Discrimination Values,
`/nformation Processing
`and
`Management, 21(3):225-232, 1985.
`
`—320-—
`
`(cid:20)(cid:21)
`12
`
`