`
`Carolyn J. Crouch
`Department of Computer Science
`Tulane University
`New Orleans, LA 70118
`
`The importance of a thesaurus in the successful operation of an information
`retrieval system is well recognized. Yet techniques which support the automatic
`generation of thesauri remain largely undiscovered. This paper describes one
`approach to the automatic generation of global
`thesauri, based on the
`discrimination value model of Salton, Yang, and Yu and on an appropriate
`clustering algorithm. This method has been implemented and applied to two
`document coliections. Preliminary results indicate that this method, which produces
`improvements in retrieval performance in excess of 10 and 15 percentin the test
`collections, is viable and worthy of continued investigation.
`
`INTRODUCTION
`
`A major factor in the successful operation of any information retrieval system is
`the set of dictionaries available for use by the system. Of these dictionaries (e.q.,
`thesauri, Statistical and/or syntactic phrase dictionaries, term hierarchies, etc.), the
`dictionary having the greatest potential
`impact on system performance is
`undoubtedly the thesaurus. Although the benefits accruing from the use of a well
`constructed thesaurus in terms of increased system performance are well
`recognized, the methodology for automatically creating such a thesaurus remains
`unspecified.
`In fact, virtually all thesauri presently in use are idiosyncratic.
`Thus a topic of considerable interest to researchers aiming to improve the
`overall performance of information retrieval systems is automatic thesaurus
`construction. This paper describes an approach to the automatic construction of a
`global! thesaurus based on the discrimination value model of Salton, Yang, and Yu
`[SALTON75a] and on an appropriate clustering algorithm. The discrimination value
`model itself is based on the vector space modei described below.
`
`
`
`Permission to copy without fee all part of this material is granted provided that the copies are not made ordistri-
`buted for direct commercial advantage, the ACM copyright notice and the title of the publication and its date
`appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy
`otherwise, or to republish, requires a fee and/or specific permission.
`
`Cc
`
`1988
`
`ACM
`
`0-8979 1-274-8
`
`
`
`0600 030988 $ 1,50
`
`
`
`
`
`—309—
`
`1
`
`EX1022
`EX1022
`
`
`
`The Vector Space Model
`
`One of the major models in information retrieval is the vector space model.
`This model views each document in the document collection as a set of unique
`words or word types. Each document may then be regarded as a term vector, and
`the complete document collection becomes a vector space of dimension m, where
`m is the number of word types in the collection.
`In the vector space model, a
`document vector, dj, is represented by a set of terms, djk, 1 < k <_m, where djk
`represents the frequency (or weight) of term k in document j {i-e., the number of
`times term k appears in documentj).
`Hf djk = 0, term k does noi appear in document
`dj. Queries, like documents, are represented by weighted term vectors.
`Given any two term vectors,
`the similarity between the vectors may be
`assumed to be inversely related to tie angle between them.
`If the two vectors
`coincide, the angle between them is zero, and the vectors are identical.
`In two
`dimensions,
`the vector space may be represented by its envelope.
`The
`(normalized) vectors are then viewed as points in the vector space, and the
`distance between any two points is inversely related to the similarity of the
`corresponding document vectors. The smaller the distance between two points, the
`smalier the angle between the corresponding vectors, and the greater the similarity
`of the vectors in terms of the number of word types they have in common.
`Salton et al [SALTON75a, SALTON75b, SALTON76] have shown that the
`best document space for retrieval purposes is one which maximizes ihe average
`separation between documents in the document space.
`In this space, it is easier to
`distinguish between documents and thus easier to retrieve documents which are
`most similar to a given query. The model which allows the terms in a collection to
`be ranked in order of their effect on space density is called the discrimination value
`model.
`
`The Discrimination Value Model
`
`The discrimination value model! [SALTON75a}] assigns specific roles to single
`terms, term phrases, and term classes for content analysis purposes and provides a
`framework within which each potential index term in a collection can be ranked in
`accordance with its usefulness as a document discriminator.
`It also offers a
`
`reascnable physical interpretation of the indexing process.
`If we consider a collection of documents, each represented by a set of
`weighted m-dimensional vectors, then the similarity coefficient computed between
`any iwo term vectors can be interpreted as a measure of the closeness or
`relatedness between the vectors in m-space.
`If the similarity coefficient is large, the
`documents are very similar and appear in close proximity to each other in the
`document space. And if the similarity coefficient is small, the documents exhibit little
`similarity and are widely separated in the document space.
`
`—310.-
`
`2
`
`
`
`The discrimination value of a term is then defined as a measure of the change
`in space separation which occurs when a given term is assigned to the document
`collection. A good discriminator is a term which, when assigned to a document,
`decreases the space density (i.e., renders the documents less similar to each
`other). Conversely, the assignment of a poor discriminator increases the space
`density. By computing the density of the document space before and after the
`assignment of each term, the discrimination value of the term can be determined.
`The terms can then be ranked in decreasing order of their discrimination values.
`Salton, Yang, and Yu [SALTON7S5a] have used discrimination value to
`determine three categories of discriminators, namely, good, poor, and indifferent
`discriminators. A term with a positive discrimination value has been found to be a
`good discriminator. Salton ef a/ suggest that these terms be used directly as index
`terms. Those terms with negative discrimination values are poor discriminators; the
`retrieval properties of such terms can be transformed by including them in
`appropriate phrases. The majority of terms are indifferent discriminators with near-
`zero discrimination values. The retrieval capabilities of these terms can be
`enhanced by means of their incorporation in appropriate thesaurus classes.
`Thus the discrimination value model presents a criterion for the formation of
`global thesauri. According to this model, a thesaurus is composed of aset of
`thesaurus classes. A thesaurus class is composed of a group of terms or word
`types. The terms within a class should be indifferent discriminators (i.e., those with
`near-zero discrimination values). Thus in order to use the criterion suggested by
`the discrimination model, the discrimination value of each term in the collection
`must be calculated and the terms ranked as good,
`indifferent and poor
`discriminators according to their discrimination values.
`But the calculation of discrimination value is normaliy expensive. Two different
`approaches have been used. One approach, the so-called exact method, involves
`the calculation of ail pairwise similarities between the document vectors of the
`collection. For a collection of n documents and m word types, the complexity of this
`algorithm is O(mn2). The second or approximate approach to the calculation of
`discrimination value involves the construction of an artificial, average document, the
`centroid, and computes the sum of the similarities of each document with the
`centroid. The centroid algorithm is O(mn).
`Modifications have been suggested which improve the execution times
`associated with both the exact and the approximate methods of calculating
`discrimination value [CRAWFORD75, WILLET85, CROUCH88]. Although the
`discrimination values produced by these two approaches differ significantly for a
`particular colfection, it has been shown that the rankings of the terms are in fact
`highly compatible [CROUCH88}. Thus of these two methods, the more efficient,
`centroid approach is the obvious method of choice when discrimination values must
`be calculated.
`
`—311-
`
`3
`
`
`
`But for a document collection of any size, even the centroid approach to the
`calculation of discrimination value may be expensive. A reasonable alternative has
`been provided by Salton, Yang, and Yu, who suggest the use of document
`frequency as an approximation to discrimination value.
`For any term k in a
`- document collection,
`the document frequency of term k, dk,
`is defined as the
`number of documents in which term k appears. Empirical results indicate that
`document frequency and discrimination value are strongly correlated. Leta
`represent the number of documents in a collection whose terms are ranked by
`increasing document frequency. According to [SALTON75a], those terms whose
`document frequency is less than n/100 may be considered fow frequency terms.
`The discrimination values of these terms are normally near-zero, and these terms
`as a whole may be considered indifferent discriminators. Likewise, terms whose
`document frequencies are greater than n/10 may be considered high frequency
`terms. These terms normaily have negative discrimination values and are
`considered poor discriminators. The remaining terms (i.e., n/10 < dk < n/100) make
`up the set of good discriminators. The discrimination values of these terms are
`positive.
`Thus document frequency may be used as an approximation to discrimination
`value. Thesaurus classes, which thecretically should consist of groupings of terms
`with near-zero discrimination values, may instead be constructed of sets of low
`frequency terms. Since document frequency is readily available for every term ina
`collection, the cost associated with this approach is minimal.
`
`AN APPROACH TO THESAURUS CONSTRUCTION
`
`An experiment was designed to investigate the feasibility of constructing a
`giobal thesaurus based on low frequency terrns. The term "global thesaurus" is
`used to differentiate this type of thesaurus from the “local thesaurus" described by
`Attar and Fraenkel [ATTAR?77].
`In a global approach, thesaurus classes, once
`constructed, are used to index both documents and queries. The local thesaurus, in
`contrast, uses information obtained fram the documents retrieved in response to a
`particular query to modify that query, which is then resubmitted to the retrieval
`system for processing in lieu of the original. Thus a global thesaurus is constructed
`prior to the indexing process and the thesaurus classes are used to index both
`documents and queries, whereas a local thesaurus is constructed dynamically
`during query processing and uses information retrieved in response to a specific
`query to modify only that query.
`
`Constructing Thesaurus Classes
`
`Constructing a global thesaurus based on the discrimination value model calis
`for the generation of thesaurus classes consisting of indifferent discriminators or (as
`a viable alternative) low frequency terms. The question of.how the classes
`
`--312--
`
`4
`
`
`
`Intuitively, a thesaurus class
`themselves are to be constructed remains open.
`should consist of terms which are closely refated in the context of the current
`collection.
`(Such terms are not necessarily synonyms in the conventional sense.)
`One approach to generating groups of closely related terms is to cluster all the
`terms in the collection. The low frequency terms which cluster together might then
`be considered a thesaurus class. Unfortunately, in an environment where there is
`little information to exploit, the resultant clusters are seldom meaningful. This is the
`case in the low frequency domain, where terms are contained in only a small
`number of the documentsin the collection.
`
`An alternative approach is to cluster the documents of the collection and to
`generate thesaurus classes from the low frequency terms contained in the
`document clusters. A key question then arises, namely, what type of clustering
`algorithm should be used? The choice is dictated largely by the type of clusters an
`algorithm produces.
`In order to generate meaningful thesaurus classes,
`the low
`frequency terms in a class should come from closely related documents. This
`implies that the document clusters themselves should be small and tight. An
`algorithm which produces clusters of this type is the complete-link clustering
`algorithm, one of a class of aggiomerative, hierarchical clustering algorithms that
`has received some attention in the literature
`[VANRIJSB78, VOORHEESSS5,
`VOORHEESSS]. Consequentiy,
`this was the algorithm used to cluster the
`documents in our test collections.
`
`Constructing a Global Thesaurus
`
`The following procedure has been utilized to construct a global thesaurus:
`1. The documentcollection is clustered via the complete-link algorithm.
`2. The resultant hierarchy is traversed and thesaurus classes are generated,
`based on specified, user-supplied parameters.
`3. The documents and queries are indexed by the thesaurus classes.
`The characteristics of
`the thesaurus classes generated in step 2 are
`determined by the following, user-supplied parameters:
`
`(a)
`
`THRESHOLD VALUE
`
`Application of the complete-link clustering algorithm produces a
`hierarchy in which the tightest clusters (i.e., thase which cluster at the
`highest threshold values) lie at the bottom of the cluster tree. These nodes
`are the leaves of the tree. For example, consider Fig. 1. The squares
`represent documents and the numbersin the circles represent the levels at
`which the documents cluster. Documents A and B cluster at a threshold
`value of 0.089, D and E cluster at a level of 0.149, and document C clusters
`with the D-E subtree at a level of 0.077. The A-B subtree and the C-D-E
`subtree cluster at a threshold value of 0.029.
`The user-supplied threshold value largely determines the documents
`from which terms are selected for inclusion in a thesaurus class.
`In Fig. 1,a
`threshold value of 0.090 would return only the D-E documentcluster, since
`
`—313-
`
`5
`
`
`
`_a.
`
`
`
`[<
`
`(19)
`oe
`
`Fig. 1 A sample hierarchy generated by the complete link algorithm
`
`
`
`the level at which the documents cluster must be greater than or equal to
`the specified threshold in order for the cluster to be recognized.
`A
`threshold value of 0.085, on the cther hand, wouid return two clusters (A-B
`and D-E).
`lf a threshold value of 0.075 were specified, two clusters would
`be returned: A-B and either D-E or C-D-E.
`
`(ob) NUMBER OF DOCUMENTS INA CLUSTER
`
`Since both D-E and C-D-E meetthe threshold value of 0.075 in Fig. 1,
`another parameter is needed to determine the number of documents to be
`included in the final cluster. This parameter, supplied by the user, is the
`number of documents in a cluster.
`In the example, a value of 2 for this
`parameter would insure that only the D-E cluster would be returned. A
`value of 3 or greater would allow the entire C-D-E cluster to be returned.
`
`(C) MINIMUM DOCUMENT FREQUENCY
`
`Once the document clusters have been selected, thesaurus classes
`are formed from the low frequency terms contained in those clusters.
`Before such terms can be selected, the minimum document frequency of
`each term,
`i.e., the criterion by which terms are determined to be “low
`frequency,” must be specified. The guidelines in [SALTON75a] may be
`helpful in setting this parameter.
`
`(qd)
`
`SPECIFICATION OF CLASS FORMATION METHOD
`
`When the documentclusters have been determined {via the settings of
`the first two parameters) and the low frequency terms from the clustered
`documents have been gathered, the thesaurus classes themselves can be
`generated. The simplest approach is to define a thesaurus class as the
`irittersection of all the low frequency terms in a cluster. Alternate methods,
`using (1) union and (2) restricted intersection (ranking the terms in the
`intersection and using the top rariked terms from that set} in lieu of the
`intersection, were also implemented.
`to be the most
`The most successful approach turned out
`straightforward. The best
`results were obtained by using simple
`intersection. The experimental results described in this paper are based on
`this approach.
`
`—314—
`
`6
`
`
`
`Once the thesaurus classes are generated, a collection can be indexed by
`replacement or augmentation (see step 3 above). Replacement entails replacing a
`term in a vector by a thesaurus class which contains that term. Augmentation
`involves adding the thesaurus class to the vector; the original term remains in the
`vector, although its weight may be modified.
`(For a discussion of term weighting
`methods, see [SALTON87].) Both methods were implemented in this experiment.
`The results obtained by augmentation are in every case vastly superior to those
`obtained through replacement , and thus this discussion focuses on indexing by
`augmentation.
`Another factor of some interest during the indexing process is the weighting of
`the new thesaurus classes. The classes themselves must be weighted so as to
`ensure that they neither override the original vector terms nor have a negligible
`effect during retrieval. An intuitively appealing
`approach is to compute the
`thesaurus class weight as a function of the terms contained in that class. Asa
`result, the thesaurus class weights in this experiment were computed according to
`the following formula, where tc_wt denotes the thesaurus class weight, tc_con_wtj
`denotes the weight of term {in this thesaurus class, |tc_caon| represents the number
`of terms in the thesaurus class, and avg_tc_con_wt represents the average weight
`of aterm in the thesaurus class. Then
`
`and
`
`avg_tc_con_wt =
`
`|tc_con|
`>) te_con_wti
`i=1
`ltc_con|
`
`tc_wt = avg_tc_conwt
`~
`Jic_con]
`
`. 05
`
`Thus a thesaurus class weight is computed by dividing the average weight of a term
`in this class by the number of terms in the class and downweighting that value by
`0.5. The experimental results indicate that this approach is indeed effective.
`
`Methodology
`
`THE EXPERIMENT
`
`Global thesauri were constructed for two test collections using the procedure
`described in the previous section. To recap:
`the collections were clustered using
`the compiete-link aigorithm. The hierarchies were traversed and thesaurus classes
`were generated based on user-supplied parameters (threshold value, number of
`documents in a cluster, minimum document frequency, and specification of class
`formation method). The documents and queries were then indexed by the
`thesaurus classes.
`In all cases, the indexing was performed by augmentation, and
`the thesaurus class weights were generated based on the weighting scheme
`
`—315-
`
`7
`
`
`
`detailed in the previous section. The weights of the original terms were not
`modified.
`
`Setting the Parameters
`
`With reference to the parametersused in step 2 of the thesaurus construction
`procedure:
`(a) Threshold value is collection-dependent. After a collection is clustered,
`appropriate threshold vaiues may be determined by examining the cluster tree. The
`selting of the threshold value is the most vital factor in forming the thesaurus
`classes. This setting in conjunction with the class formation method determines the
`size of the thesaurus class.
`if the setting is too high, thesaurus classes will be
`generated with either too many terms (using union for class formation) or with too
`few terms (using intersection).
`If the threshold is too low, very few classes wiil be
`generated.
`{ob} The number of documents in a cluster obviously also affects thesaurus
`class size, but the overall effect of this parameter is considerably less than that af
`the threshold value.
`An initial prernise of this experiment is that terms in a
`thesaurus class should come from closely related documents, which tn turn implies
`that ihe document clusters themselves must be small and tight. For investigative
`purposes this parameter was allowed to range from 2 to 5 in our test runs, butit is
`always subject to (i.e., governed by} the threshold value and in most cases, the
`actual number of documents in a cluster ranges from 2 to 3.
`{c) Minimum document
`frequency can be set by using the guidelines
`suggested in [SALTON7S5a].
`(d) Class formation method is the parameter which specifies exactly how the
`thesaurus class itself is formed once the document cluster has been determined. Of
`
`the three approaches investigated (union, intersection, and restricted intersection),
`simple intersection gives the best results in all cases. Thus a thesaurus composed
`of thesaurus classes consisting of sets of
`low frequency terms common to a small
`group of closely related documents appears to have the greatest potential for
`effecting an improvementin system performance.
`
`Collection Weighting
`
`thesaurus has been constructed and the collection
`Once the global
`(documents and queries) indexed by the thesaurus classes, the vectors are then
`weighted appropriately, using facilities made available through the Smart
`experimental
`retrieval system ([BUCKLEY85].
`(The original vectors, before
`modification, are weighted by simple term frequency.) Smart offers a number of
`options with respect io weighting. The most common approach is probably to
`weight the vectors using “tf-idf" weights (i.e., term frequency multiplied by inverse
`document frequency). This scheme is appealing because it allows a term to be
`weighted by a factor dependent upon both its importance within a document(its
`--316—
`
`8
`
`
`
`term frequency) and its importance within the document collection as a whole (its
`document frequency). Moreover,
`tf-idf weighting has proved most effective in
`comparison with other weighting schemes in numerous experiments.
`The collection weighting method usedin this experiment is a variation of the tf-
`idf scheme, wherein the term frequency component is referred to as “augmented
`normalized term frequency.” In this method, the term frequency is normalized and
`augmented and is thus guaranteed to lie between 0.5 and 1.0. This component
`then replaces the tf componentin the tf-idf scheme to generate the weight for this
`term in the modified document collection.
`
`Similarity Measure
`
`A common measure of the similarity between two vectors dj and dk is the
`cosine function
`
`djedk
`= —__1-*<__
`cos dj, d
`4}. dk)=Tait Kil
`
`where dj-d is the inner product and ||dj \|2 = (dj -dj). This is the measure used to
`compute similarity in this experiment.
`
`Collection Characteristics
`
`The thesaurus construction procedure described in this paper was run on two
`standard test collections, namely, ADI and Medlars. The characteristics of these
`collections are found in Table |.
`(For more complete descriptions of these
`collections, see [FOX83A, FOX83b, VOORHEES85}.)
`
`Table|
`
`Collection Statistics
`
`
`Collection
`
`Number of
`Vectors
`
`Number of
`Terms
`
`Mean Number of
`Terms per Vector
`
`ADI
`
`Medlars
`
`82
`
`1033
`
`822
`
`6927
`
`25.5
`
`51.6
`
`—317—
`
`9
`
`
`
`Empirical Results
`
`The results of the experiment, as applied to the AD! and Medlars collections,
`are displayed in Table Il. The normal evaluation measures, precision and recall,
`are used.
`(Recall is defined as the proportion of relevant material retrieved, while
`precision is the proportion of retrieved material that is relevant.) Part (a) gives the
`statistics in terms of precision and recall when the thesaurus generation procedure
`is applied to the ADI collection (using in this case the following parameters: a
`threshold level of 0.075, 5 documents per cluster, a minimum document frequency
`of 20, and intersection as the class formation method). The Statistics which apply to
`the original, unmodified AD! collection are found in column 1. Column 2 pertains to
`the modified ADI collection after application of the global thesaurus. The average
`precision at 3 points of recall (0.25, 0.50, and 0.75} is calculated for each collection
`and shows an overall improvement of 10.6 per cent when the global thesaurus is
`applied.
`Part (b) gives a similar view when the procedure is applied to the Medlars
`collection (with corresponding parameter settings of 0.120, 3, 50, and intersection).
`Again, column 1 applies to the original Medlars collection and column 2 applies to
`same collection after application of the global thesaurus. The improvement in
`precision (based on the 3 point average) of the modified collection over the original
`is 15.8 per cent.
`
`SUMMARY
`
`An experiment was designed to investigate the feasibility of constructing a
`global thesaurus based on the discrimination value model and an appropriate
`clustering algorithm.
`A thesaurus construction procedure was designed,
`impiemented, and applied to two standard test collections. The preliminary results
`show that substantial
`improvements can be effected by making a properly
`constructed giobal thesaurus available to a retrieval system during processing.
`Moreover, as the empirical results indicate,
`it
`is possible to construct such a
`thesaurus automatically. The question of whether it is possible to construct such a
`thesaurus for other collections remains to be answered.
`
`—318—
`
`10
`10
`
`
`
`TABLEII
`A Comparison of Retrieval Statistics:
`The Original versus the Thesaurus-Based Collection
`
`eee
`
`collection: adl
`
`threshold:
`
`.075
`
`no. doca: &
`
`ecllection: med
`
`threshold:
`
`,120
`
`no. doce: 3
`
`Recall + Precision
`Level
`t
`0.00
`0.é6869
`0.05
`0.6858
`0.10
`0.8868
`0.15
`0.6770
`0.20
`o.86214
`0.25
`0.5587
`0.30
`0.8487
`0.35
`0.6020
`0.40
`0.4b98
`0,45
`0.4798
`0.50
`0.4708
`0.55
`9.3800
`9.89
`9.5800
`0.66
`@.3618
`0.70
`0.2668
`o.75
`0.3076
`9.80
`0.2570
`0.85
`O.2204
`0.90
`0.2242
`0.05
`0.2238
`1.00
`0.2236
`
`2
`O.7074
`O.7074
`Q.7O17
`0.6012
`0.4348
`6.5998
`0.8788
`0.5508
`0.63904
`0.8201
`0.5166
`0.4291
`0.4204
`O.4182
`0.33872
`8.3373
`0.5015
`0.2882
`0.28046
`0.2837
`0.2837
`
`Recall - Precision
`Lavel
`1
`0.00
`0.8206
`0.06
`0.8427
`0.10
`O.79e5
`9.16
`0.7662
`9.20
`Q.7444
`0.25
`‘9.7247
`0.30
`0.7003
`0.35
`0.6801
`0.40
`0.8488
`0.45
`0.8060
`0.60
`O.6711
`0.55
`9.5219
`0.80
`0.4925
`0.06
`o.4487
`0.70
`0.4148
`e.75
`Oo.s¢70
`9.60
`o.3s487
`0.865
`0.2918
`0.90
`0.2259
`0.95
`0.1368
`41.00
`0.0950
`
`a
`0.6847
`0.8278
`0.8700
`0.64390
`0.8107
`90,8058
`0.7017
`0.7610
`0.7184
`0.8910
`0.0569
`o.639038
`0.8000
`0.5616
`0.5287
`0.4902
`0.4508
`90,4106
`0.3013
`0.1887
`O.i1664
`
`:
`
`AVOTage precision for 3 intermediate points
`Prac=
`0.4353
`Q.4516
`M Prec Change =
`10.8
`2
`i
`Statiatic
`O.79840 0.8015
`Horm Recali
`O,8311 O.8515
`Nora Precis
`0.1671
`0.1862
`Rank Recall
`0.3045 0.4600
`Log Precis
`Precis after 10 docs 0,2257 ©,2200
`Precia after 20 doce
`00,1162
`00,1171
`Recall after 10 docs
`0.5324 ©.6354
`Recall after 30 docu 0.7670 9.7761
`E, 0.6. 10 doce
`O.7@01 G.7é629
`EB, 1.0, 10 docs
`0.7145 0.7168
`E, 2.0, 10 doce
`0.4262 0.4258
`E. 90.6. 30 docs
`O.685:
`9.8038
`E, 1.0, 30 docs
`O.8140 0.9122
`E, 2.9, 30 docs
`O,87099 O,86772
`Number queries
`35
`35
`
`Average precision for & interzediate points
`Preca
`G.58465
`0,8536
`% Prac Change =
`16.6
`Statistic
`Mors Recall
`Nora Precis
`Rant Recall
`Log Precis
`Precis after {0 docs
`Precis Bfter 80 docs
`Recall atter 16 docs
`Recall after 80 docs
`E. 0.5,
`10 docs
`E, 1.0, 10 doca
`E, 2.9, {0 docs
`EE. 0.5, 80 doce
`E, 1.0, 30 doca
`E. 2.0, 30 doca
`Nuaber queries
`
`2
`1
`0.0274 0.9578
`o.7962 0.8503
`0.2140 O.8044
`O.70709 O.7837
`0.4200 0.6867
`9.4500 0.51566
`0.3003 90.3300
`0.8117 0.6031
`0.5058
`0.4647
`0.8088
`0.6609
`0.6703 O.6376
`0.53368
`0.4667
`0.4998
`0.4289
`0.4458 0.3696
`ao
`30
`
`(a) ADI
`
`(b) Medlars
`
`~—319-—
`
`11
`11
`
`
`
`REFERENCES
`
`[ATTAR77]
`
`Attar, R., and A.S. Fraenkel, Local Feedback in Full-text Retrieval
`Systems, Journal of the ACM, 24(3):397-417, 1977.
`
`[BUCKLEY85]
`
`Implementation of the Smart Information Retrieval
`Buckley, C.,
`System, Tech. Report 85-686, Dept. of Computer Science,
`Cornell University, May 1985.
`
`[CRAWFORD75]
`
`Crawford, R., The Computation of Discrimination Values,
`Information Processing and Management, 11:249-253, 1975.
`
`{CROUCH88]
`
`An Analysis of Approximate Versus Exact
`Crouch, C.,
`Discrimination Values, information Processing and Management,
`24(1):5-16, 1988.
`
`[FOX83a]
`
`[FOX83b]
`
`Fox, E., Extending the Boolean and Vector Space Models of
`Information Retrieval with P-norm Queries and Multiple Concept
`Types, Ph.D. Thesis, Cornell University, Aug. 1983.
`
`Fox, E., Characteristics of Two New Experimental Collections in
`Computer and Information Science Containing Textual and
`Bibliographic Concepts, Tech. Report 83-561, Dept. of Computer
`Science, Cornell University, Sept. 1983.
`
`[SALTON75a]
`
`Salton, G., C.S. Yang, and C.T. Yu, A Theory of Term Importance
`in Automatic Text Analysis,
`Journal of the ASIS , 26(1):33-44,
`1975.
`
`[SAL.TON75b]
`
`Salton, G., A. Wong, and C. S. Yang, A Vector Space Medel! for
`Automatic Indexing, Communications of the ACM, 18:613-20,
`1975.
`
`[SALTON76]
`
`[SALTON87]
`
`Salton, G., and A. Wong, On the Role of Words and Phrases in
`Automatic Text Analysis, Computers and Humanities, 10:69-87,
`1976.
`
`Salton, G. and C. Buckley, Term Weighting Approaches in
`Automatic Text Retrieval, Tech. Report 87-881, Dept. of Computer
`Science, Cornell University, Nov. 1987.
`
`[VANRIJSB79]
`
`information Retrieval, 2nd edition,
`van Rijsbergen, C.J.,
`Butterworths, London, England, 1979.
`
`[VOORHEES85]
`
`Voorhees, E., The Effectiveness and Efficiency of Agglomerative
`Hierarchic Clustering in Document Retrieval, Tech. Report 85-
`705, Dept. of Computer Science, Cornell University, Oct. 1985.
`
`[VOORHEES86]
`
`Implementing Agglomerative Hierarchic Clustering
`Voorhees, E.,
`Aigorithms for Use in Document Retrieval, Tech. Report 86-765,
`Dept. of Computer Science, Cornell University, July 1986.
`
`[WILLET85}
`
`Willet, P., An Algorithm for the Calculation of Exact Term
`Discrimination Values,
`J/nformation Processing
`and
`Management, 21(3):225-232, 1985.
`
`12
`12
`
`