`
`Carolyn J. Crouch
`Department of Computer Science
`Tulane University
`New Orleans, LA 70118
`
`The importance of a thesaurus in the successful operation of an information
`retrieval system is well recognized. Yet techniques which support the automatic
`generation of thesauri remain largeiy undiscovered. This paper describes one
`approach to the automatic generation of global
`thesauri, based on the
`discrimination value model of Salton, Yang, and Yu and on an appropriate
`clustering algorithm. This method has been implemented and applied to two
`document collections. Preliminary results indicate that this method, which produces
`improvements in retrieval performance in excess of 10 and 15 percent in the test
`collections, is viable and worthy of continued investigation.
`
`INTRODUCTION
`
`A major factor in the successful operation of any information retrieval system is
`
`the set of dictionaries available for use by the system. Of these dictionaries (e.g.,
`
`thesauri, statistical and/or syntactic phrase dictionaries, term hierarchies. etc.), the
`
`dictionary having the greatest potential
`
`impact on system performance is
`
`undoubtedly the thesaurus. Although the benefits accruing from the use of a well
`
`constructed thesaurus in terms of increased system performance are well
`
`recognized, the methodology for automatically creating such a thesaurus remains
`
`unspecified.
`
`In fact, virtualiy all thesauri presently in use are idiosyncratic.
`
`Thus a topic of considerable interest to researchers aiming to improve the
`
`overall performance of information retrieval systems is automatic thesaurus
`
`construction. This paper describes an approach to the automatic construction of a
`
`global thesaurus based on the discrimination value model of Salton, Yang, and Yu
`
`[SALTON75a] and on an appropriate clustering algorithm. The discrimination value
`
`model itself is based on the vector space model described below.
`
`
`
`Permission to copy without fee all part of this material is granted provided that the copies are not made or distri-
`buted for direct commercial advantage, the ACM copyright notice and the title of the publication and its date
`appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy
`otherwise, or to republish, requires a fee andlor specific permission.
`
`C
`
`1988
`
`
`
`
`
`0—89791-274-8 88 0600 0309ACM $ 1,50
`
`
`
`
`
`
`
`—309—-
`
`(cid:20)
`1
`
`ELASTIC - EXHIBIT 1022
`
`ELASTIC - EXHIBIT 1022
`
`
`
`The Vector Space Model
`
`One of the major models in information retrieval is the vector space model.
`
`This model views each document in the document collection as a set of unique
`
`words or word types. Each document: may then be regarded as a term vector, and
`
`the complete document collection becomes a vector space of dimension m, where
`
`m is the number of word types in the collection.
`in the vector space model, a
`document vector, dj, is represented by a set of terms. djk, 1 5 k.<_m, where djk
`
`represents the frequency (or weight) of term k in document 1 (i.e., the number of
`
`times term k appears in document j).
`
`if djk = 0, term k does not appear in document
`
`d]. Queries, like documents, are represented by weighted term vectors.
`
`Given any two term vectors,
`
`the similarity between the vectors may be
`
`assumed to be inversely related to the angle between them.
`
`if the two vectors
`
`coincide, the angle between them is zero, and the vectors are identical.
`
`In two
`
`dimensions,
`
`the veCtor space may be represented by its envelope.
`
`The
`
`(normalized) vectors are then viewed as points in the vector space, and the
`
`distance between any two points is inversely related to the similarity of the
`
`corresponding document vectors. The smailer the distance between two points, the
`
`smalier the angle between the corresponding vectors, and the greater the similarity
`
`of the vectors in terms of the number of word types they have in common.
`
`Salton et al [SALTON75a, SALTON75b, SALTON76] have shown that the
`
`best document space for retrieval purposes is one which maximizes the average
`
`separation between documents in the document Space.
`
`in this space, it is easier to
`
`distinguish between documents and thus easier to retrieve documents which are
`
`most similar to a given query. The model which allows the terms in a collection to
`
`be ranked in order of their effect on space density is called the discrimination value
`model.
`
`The Discrimination Value Model
`
`The discrimination value model [ESALTON75a} assigns specific roles to singie
`
`terms, term phrases, and term classes for content analysis purposes and provides a
`
`framework within which each potential index term in a collection can be ranked in
`accordance with its usefulness as a document discriminator.
`It also offers a
`
`reasonable physical interpretation of the indexing process.
`
`if we consider a collection of documents, each represented by a set of
`
`weighted m-dimensional vectors, then the similarity coefficient computed between
`
`any two term vectors can be interpreted as a measure of the closeness or
`
`relatedness between the vectors in m—space.
`
`it the similarity coefficient is large, the
`
`documents are very similar and appear in close proximity to each other in the
`
`document space. And if the similarity coefficient is small, the documents exhibit littte
`
`similarity and are widely separated in the document space.
`
`—3|0 ,
`
`(cid:21)
`
`
`
`The discrimination value of a term is then defined as a measure of the change
`
`in space separation which occurs when a given term is assigned to the document
`
`collection. A good discriminator is a term which, when assigned to a document,
`
`decreases the space density (i.e., renders the documents less similar to each
`
`other). Conversely, the assignment of a poor discriminator increases the space
`
`density. By computing the density of the document space before and after the
`
`assignment of each term, the discrimination value of the term can be determined.
`
`The terms can then be ranked in decreasing order of their discrimination values.
`
`Salton, Yang, and Yu [SALTON75a] have used discrimination value to
`
`determine three categories of discriminators, namely, good. poor, and indifferent
`
`discriminators. A term with a positive discrimination value has been found to be a
`
`good discriminator. Salton et at suggest that these terms be used directly as index
`
`terms. Those terms with negative discrimination values are poor discriminators; the
`
`retrieval properties of such terms can be transformed by including them in
`
`appropriate phrases. The majority of terms are indifferent discriminators with near-
`
`zero discrimination values. The retrieval capabilities of these terms can be
`
`enhanced by means of their incorporation in appropriate thesaurus classes.
`
`Thus the discrimination value model presents a criterion for the formation of
`global thesauri. According to this model, a thesaurus is composed of a‘ set of
`
`thesaurus classes. A thesaurus class is composed of a group of terms or word
`
`types. The terms within a class should be indifferent discriminators (i.e., those with
`
`near-zero discrimination values). Thus in order to use the criterion suggested by
`
`the discrimination model, the discrimination value of each term in the collection
`
`must be calculated and the terms ranked as good.
`
`indifferent and poor
`
`discriminators according to their discrimination values.
`
`But the calculation of discrimination value is normally expensive. Two different
`
`approaches have been used. One approach, the so-called exact method, involves
`
`the calculation of all pairwise similarities between the document vectors of the
`
`collection. For a collection of n documents and m word types, the complexity of this
`algorithm is O(mn2). The second or approximate approach to the calculation of
`
`discrimination value involves the construction of an artificial, average document, the
`
`centroid, and computes the sum of the similarities of each document with the
`
`centroid. The centroid algorithm is O(mn).
`
`Modifications have been suggested which improve the execution times
`
`associated with both the exact and the approximate methods of calculating
`
`discrimination value [CRAWFORD75, WlLLET85, CROUCHBB]. Although the
`
`discrimination values produced by these two approaches differ significantly for a
`
`particular collection, it has been shown that the rankings of the terms are in fact
`
`highly compatible [CROUCH-188}. Thus of these two methods, the more efficient,
`
`centroid approach is the obvious method of choice when discrimination values must
`be calculated.
`
`~3lls
`
`(cid:22)
`
`
`
`But for a document collection of any size, even the centroid approach to the
`
`calculation of discriminatiOn value may be expensive. A. reasonable alternative has
`
`been provided by Salton. Yang, and Yu, who suggest the use of document
`
`frequency as an approximation to discrimination value.
`
`For any term k in a
`
`-document collection, the document frequency of term k, dk.
`
`is defined as the
`
`number of documents in which term k appears. Empirical results indicate that
`
`document frequency and discrimination value are strongly correlated.
`
`Let n
`
`represent the number of documents in a collection whose terms are ranked by
`
`increasing document frequency. According to [SALTON75a], those terms whose
`
`document frequency is less than n/1DO may be considered low frequency terms.
`
`The discrimination values of these terms are normally near-zero, and these terms
`
`as a whole may be considered indifferent discriminators. Likewise, terms whose
`
`document frequencies are greater than n/10 may be considered high frequency
`
`terms. These terms normally have negative discrimination values and are
`
`considered poor discriminators. The remaining terms (i.e.. n/1O s dk s n/100) make
`
`up the set of good discriminators. The discrimination values of these terms are
`
`positive.
`
`Thus document frequency may be used as an approximation to discrimination
`
`value. Thesaurus classes. which theoretically should consist of groupings of terms
`
`with near-zero discrimination values. may instead be constructed of sets of low
`
`frequency terms. Since document frequency is readily available for every term in a
`
`collection, the cost associated with this approach is minimal.
`
`AN APPROACH TO THESAURUS CONSTRUCTION
`
`An experiment was designed to investigate the feasibility of constructing a
`
`global thesaurus based on low frequency terms. The term "global thesaurus" is
`
`used to differentiate this type of thesaurus from the "local thesaurus" described by
`
`Attar and Fraenkel [ATTAFt77].
`
`in a. global approach, thesaurus classes. once
`
`constructed, are used to index both documents and queries. The local thesaurus, in
`
`contrast, uses information obtained from the documents retrieved in response to a
`
`particular query to modify that query, which is then resubmitted to the retrieval
`
`system for processing in lieu of the original. Thus a global thesaurus is constructed
`
`prior to the indexing process and the thesaurus classes are used to index both
`
`documents and queries. whereas a local thesaurus is constructed dynamically
`
`during query processing and uses information retrieved in response to a specific
`
`query to modify only that query.
`
`Constructing Thesaurus Classes
`
`Constructing a global thesaurus based on the discrimination value model calls
`
`for the generation of thesaurus classes consisting of indifferent discriminators or (as
`
`a viable alternative) low frequency terms. The question ofthow the classes
`
`-—31%
`
`(cid:23)
`
`
`
`themselves are to be constructed remains open.
`
`intuitively, a thesaurus class
`
`should consist of terms which are closely related in the context of the current
`
`collection.
`
`(Such terms are not necessarily synonyms in the conventional sense.)
`
`One approach to generating groups of closely related terms is to cluster all the
`
`terms in the collection. The low frequency terms which cluster together might then
`be considered a thesaurus class. Unfortunatelyfiin an environment where there is
`
`little information to exploit. the resultant clusters are seldom meaningful. This is the
`
`case in the low frequency domain. where terms are contained in only a small
`number of the documents in the collection.
`
`An alternative approach is to cluster the documents of the collection and to
`
`generate thesaurus classes from the low frequency terms contained in the
`
`document clusters. A key question then arises, namely, what type of clustering
`
`algorithm should be used? The choice is dictated largely by the type of clusters an
`algorithm produces.
`In order to generate meaningful thesaurus Classes,
`the low
`
`frequency terms in a class should come from closely related documents. This
`
`implies that the document clusters themselves should be small and tight. An
`
`algorithm which produces clusters of this type is the complete-link clustering
`
`algorithm, one of a class of agglomerative, hierarchical clustering algorithms that
`
`has received some attention in the literature
`
`[VANRlJSBTQ. VOORHEESBS.
`
`VOORHEE886]. Consequently,
`documents in our test collections.
`
`this was the algorithm used to cluster the
`
`Constructing a Global Thesaurus
`
`The following procedure has been utilized to construct a global thesaurus:
`
`1. The document collection is clustered via the complete-link algorithm.
`
`2. The resultant hierarchy is traversed and thesaurus classes are generated,
`
`based on specified, user-supplied parameters.
`
`3. The documents and queries are indexed by the thesaurus classes.
`
`The characteristics of
`
`the thesaurus classes generated in step 2 are
`
`determined by the following, user—supplied parameters:
`
`(a)
`
`THRESHOLD VALUE
`
`Application of the complete-link clustering algorithm produces a
`hierarchy in which the tightest clusters (i.e., those which cluster at the
`highest threshold values) lie at the bottom of the cluster tree. These nodes
`are the leaves of the tree. For example, consider Fig. 1. The squares
`represent documents and the numbers in the circles represent the levels at
`which the documents cluster. Documents A and B cluster at a threshold
`value of 0.089, D and E cluster at a level of 0.149, and document C clusters
`with the D-E subtree at a level of 0.077. The A-B subtree and the C-D—E
`subtree cluster at a threshold value of 0.029.
`
`The user-supplied threshold value largely determines the documents
`from which terms are selected for inclusion in a thesaurus class.
`In Fig. 1, a
`threshold value of 0.090 would return only the D-E document cluster, since
`
`—-3l3—
`
`(cid:24)
`
`
`
`/o\\
`
`
`
`rcfi
`
`)\@\
`e a
`
`Fig. 1 A sample hierarchy generated by the complete link algorithm
`
`
`the level at which the documents cluster must be greater than or equal to
`the specified threshold in order for the cluster to be recognized.
`A
`threshold value of 0.085, on the other hand, would return two clusters (A-B
`and D—E).
`If a threshold value of 0.075 were specified, two clusters would
`be returned: A-B and either D-E or C-D-E.
`
`(b) NUMBER OF DOCUMENTS IN A CLUSTER
`
`Since both D-E and C-D-E meet the threshold value of 0.075 in Fig. 1,
`another parameter is needed to determine the number of documents to be
`included in the final cluster. This parameter, supplied by the user, is the
`number of documents in a cluster.
`in the example, a value of 2 for this
`parameter would insure that only the D-E cluster would be returned. A
`value of 3 or greater would allow the entire C—D—E cluster to be returned.
`
`(c) MINIMUM DOCUMENTFFIEOUENCY
`
`Once the document clusters have been selected, thesaurus classes
`are formed from the low frequency terms contained in those clusters.
`Before such terms can be selected, the minimum document frequency of
`each term,
`i.e., the criterion by which terms are determined to be "low
`frequency," must be specified. The guidelines in [SALTON75a] may be
`helpful in setting this parameter.
`
`(d)
`
`SPECIFICATION OF CLASS FOHMA TION METHOD
`
`When the document clusters have been determined (via the settings of
`the first two parameters) and the low frequency terms from the clustered
`documents have been gathered, the thesaurus classes themselves can be
`generated. The simplest approach is to define a thesaurus class as the
`intersection of all the low frequency terms in a cluster. Alternate methods,
`using (1) union and (2) restricted intersection (ranking the terms in the
`intersection and using the top ranked terms from that set) in lieu of the
`intersection, were also implemented.
`to be the most
`The most successful approach turned out
`straightforward. The best
`results were obtained by using simple
`intersection. The experimental results described in this paper are based on
`this approach.
`
`“314—
`
`(cid:25)
`
`
`
`Once the thesaurus classes are generated, a collection can be indexed by
`
`replacement or augmentation (see step 3 above). Replacement entails replacing a
`
`term in a vector by a thesaurus class which contains that term. Augmentation
`
`involves adding the thesaurus class to the vector; the original term remains in the
`
`vector, although its weight may be modified.
`
`(For a discussion of term weighting
`
`methods, see [SALTON87].) Both methods were implemented in this experiment.
`
`The results obtained by augmentation are in every case vastly superior to those
`
`obtained through replacement , and thus this discussion focuses on indexing by
`
`augmentation.
`
`Another factor of some interest during the indexing process is the weighting of
`
`the new thesaurus classes. The classes themselves must be weighted so as to
`
`ensure that they neither override the original vector terms nor have a negligible
`
`effect during retrieval. An intuitively appealing
`
`approach is to compute the
`
`thesaurus class weight as a function of the terms Contained in that class. As a
`
`result, the thesaurus class weights in this experiment were computed according to
`
`the following formula. where tc_wt denotes the thesaurus class weight, tc_con_wti
`
`denotes the weight of term i in this thesaurus class, ltc_con| represents the number
`
`of terms in the thesaurus class. and avg_tc_con_wt represents the average weight
`of a term in the thesaurus class. Then
`
`and
`
`avg_tc_co n__wt =
`
`ltc_con|
`Z tc_con_wti
`i=1
`[tc_conl
`
`tc_wt _ avg_tc_con_wt
`'
`ltc_con|
`
`. 0'5
`
`Thus a thesaurus class weight is computed by dividing the average weight of a term
`
`in this class by the number of terms in the class and downweighting that value by
`
`0.5. The experimental results indicate that this approach is indeed effective.
`
`THE EXPERIMENT
`
`Methodology
`
`Global thesauri were constructed for two test collections using the procedure
`
`described in the previous section. To recap:
`
`the collections were clustered using
`
`the complete-link algorithm. The hierarchies were traversed and thesaurus classes
`
`were generated based on user-supplied parameters (threshold value, number of
`
`documents in a cluster. minimum document frequency. and specification of class
`
`formation method). The documents and queries were then indexed by the
`
`thesaurus classes.
`
`In all cases. the indexing was performed by augmentation, and
`
`the thesaurus class weights were generated based on the weighting scheme
`
`—3lS-—
`
`(cid:26)
`
`
`
`detailed in the previous section. The weights of the original terms were not
`modified.
`
`Setting the Parameters
`
`With reference to the parameters used in step 2 of the thesaurus construction
`
`procedure:
`
`(a) Threshold value is collection-dependent. After a collection is clustered,
`
`appropriate threshold values may be determined by examining the cluster tree. The
`
`setting of the threshold value is the most vital factor in forming the thesaurus
`
`classes. This setting in conjunction with the class formation method determines the
`
`size of the thesaurus class.
`
`if the setting is too high, thesaurus classes will be
`
`gene-rated with either too many terms (using union for class formation) or with too
`
`few terms (using intersection).
`
`lithe threshold is too low. very few classes will be
`
`generated.
`
`(b) The number of documents in a cluster obviously also affects thesaurus
`
`class. size, but the overall effect of this parameter is considerably less than that of
`
`the threshold value. An initial premise of this experiment is that terms in a
`
`thesaurus clasa should come from closely related documents, which in turn implies
`
`that the document clusters themselves must be small and tight. For investigative
`
`purposes this parameter was allowed to range from 2 to 5 in our test runs, but it is
`
`always subject to (i.e., governed by) the threshold value and in most cases, the
`
`actual number of documents in a cluster ranges from 2 to 3.
`
`(c) Minimum document
`
`frequency can be set by using the guidelines
`
`suggested in [SALTON75a].
`
`(d) Class formation method is the parameter which specifies exactly how the
`thesaurus class itself is formed once the document cluster has been determined. Of
`
`the three approaches investigated (union, intersection, and restricted intersection).
`
`simple intersection gives the best results in all cases. Thus a thesaurus composed
`
`of thesaurus classes consisting of sets of
`
`low frequency terms common to a small
`
`group of closely related documents appears to have the greatest potential for
`
`effecting an improvement in system performance.
`
`Collection Weighting
`
`Once the global
`
`thesaurus has been constructed and the collection
`
`(documents and queries) indexed by the thesaurus classes, the vectors are then
`
`weighted appropriately, using facilities made available through the Smart
`
`experimental
`
`retrieval system [BUCKLEYBS].
`
`(The original vectors, before
`
`modification, are weighted by simple term frequency.) Smart offers a number of
`
`options with respect to weighting. The most common approach is probably to
`
`weight the vectors using "tf—idf" weights (i.e., term frequency multiplied by inverse
`
`document frequency). This scheme is appealing because it allows a term to be
`
`weighted by a factor dependent upon both its importance within a document (its
`——3 l 5”
`
`(cid:27)
`
`
`
`term frequency) and its importance within the document colleCtion as a whole (its
`
`document frequency). Moreover,
`
`tf-idf weighting has proved most effective in
`
`comparison with other weighting schemes in numerous experiments.
`
`The collection weighting method used in this experiment is a variation of the tf—
`
`idf scheme. wherein the term frequency component is referred to as "augmented
`
`normalized term frequency."
`
`in this method, the term frequency is normalized and
`
`augmented and is thus guaranteed to lie between 0.5 and 1.0. This component
`
`then replaces the tf component in the tf-idf scheme to generate the weight for this
`term in the modified document collection.
`
`Similarity Measure
`
`A common measure of the similarity between two vectors dj and dk is the
`cosine function
`
`COS d'. d
`(
`‘
`
`k)
`
`d'-dk
`= —""'_L""-"'—
`Ildjll lldkll
`
`where dj-dk is the inner product and “iii H2 = (di -dj). This is the measure used to
`compute similarity in this experiment.
`
`Collection Characteristics
`
`The thesaurus construction procedure described in this paper was run on two
`
`standard test collections. namely, ADI and Medlars. The characteristics of these
`
`collections are found in Table i.
`
`(For more complete descriptions of these
`
`collections, see [FOX83A, FOX83b, VOORHEE8851.)
`
`Table |
`
`Collection Statistics
`
`
`Collection
`
`Number of
`
`Vectors
`
`Number of
`
`Mean Number of
`
`Terms
`
`Terms per Vector
`
`ADI
`
`Medlars
`
`82
`
`1033
`
`822
`
`6927
`
`25.5
`
`51.6
`
`—317—
`
`(cid:28)
`
`
`
`Empirical Results
`
`The results of the experiment. as applied to the AD! and Medlars collections.
`
`are displayed in Table II. The normal evaluation measures, precision and recall,
`
`are used.
`
`(Recall is defined as the proportion of relevant material retrieved, while
`
`precision is the proportion of retrieved material that is relevant.) Part (a) gives the
`
`statistics in terms of precision and recall when the thesaurus generation procedure
`
`is applied to the ADl collection (using in this case the following parameters: a
`
`threshold level of 0.075, 5 documents per cluster, a minimum document frequency
`
`of 20. and intersection as the class formation method). The statistics which apply to
`
`the original, unmodified ADI collection are found in column 1. Column 2 pertains to
`
`the modified ADI collection after application of the global thesaurus. The average
`
`precision at 3 points of recall (0.25, 0.50, and 0.75) is calculated for each collection
`
`and shows an overall improvement of 10.6 per cent when the global thesaurus is
`
`appfled.
`
`Part (D) gives a similar view when the procedure is applied to the Medlars
`
`collection (with corresponding parameter settings of 0.120, 3, 50, and intersection).
`
`Again, column 1 applies to the original Medlars collection and column 2 applies to
`
`same collection after application of the global thesaurus. The improvement
`
`in
`
`precision (based on the 3 point average) of the modified collection over the original
`
`is 15.8 per cent.
`
`SUMMARY
`
`An experiment was designed to investigate the feasibility of constructing a
`
`global thesaurus based on the discrimination value model and an appropriate
`
`clustering algorithm.
`
`A thesaurus construction procedure was designed,
`
`implemented, and applied to two standard test collections. The preliminary results
`
`show that substantial
`
`improvements can be effected by making a properly
`
`constructed global thesaurus available to a retrieval system during processing.
`
`Moreover, as the empirical results indicate.
`
`it
`
`is possible to construct such a
`
`thesaurus automatically. The question of whether it is possible to construct such a
`thesaurus for other collections remains to be answered.
`
`73187
`
`(cid:20)(cid:19)
`10
`
`
`
`TABLE II
`
`A Comparison of Retrieval Statistics:
`
`The Original versus the Thesaurus-Based Collection
`
`MM...—
`
`coll-c5105:
`
`5111.
`
`sun-hold:
`
`.075
`
`no. data: 5
`
`cull-china: I'd
`
`tin-55110111:
`
`.120
`
`no. 11055: 5
`
`355111 «- Prccuiol
`1.070].
`I
`0.00
`0.6858
`0.06
`0.0868
`0.10
`0.8858
`0.15
`0.0170
`0.20
`0.5214
`0.25
`0.5507
`0.30
`0.5407
`0-35
`0-5020
`0-40
`0-48"
`0.45
`0.4790
`0.50
`0.4708
`0-55
`0.3600
`0-30
`°~3€°°
`0-55
`0-3513
`0.70
`0.2080
`0.75
`0.3616
`0.00
`0.2370
`0-85
`0.2264
`0.00
`0.2252
`0.05
`0.2230
`1.00
`0.2238
`
`2
`0.1074
`0.7074
`0.1017
`0.0012
`0.5318
`0-5885
`0.570!
`0-5309
`0-5301
`0.5201
`0.5100
`0-429!
`0-429‘
`0-4192
`0.8312
`0-3372
`0.0015
`0.2882
`0.2815
`0.2837
`0.2031
`
`lac-11 - Fuel-105
`L|:51
`l
`0.00
`0.0205
`0.06
`0.8127
`0.10
`0.7095
`0.16
`0.7682
`0.20
`0.71“
`0.25
`‘o.7217
`0.30
`0.7009
`0.35
`0.0801
`0.10
`0.3155
`0.15
`0.5050
`o_5o
`o_5111
`0.55
`0.5219
`0.50
`0.1025
`0.55
`0.1557
`0.10
`0,5143
`0.75
`0.5975
`0.80
`0.5107
`0.35
`0.2015
`0.90
`0.2209
`0.05
`0.1350
`1.00
`0.0050
`
`I
`0.9017
`0.0270
`0.0790
`0.0130
`0.8197
`0.8058
`0.7017
`0.7510
`0.7181
`0.5310
`o_5550
`0.5505
`0.5000
`0.5510
`o_5257
`0.4992
`0.4508
`0.1105
`0.0013
`0.1897
`0.1550
`
`.
`
`ATOI'I‘I praciuou for 3 intItlIdiln pout!
`Prat:
`0.4353
`0.!815
`I Prcc Chang. 3
`10.!
`2
`1
`snat1551c
`0.7940 0.8015
`Non Recall
`0.0311
`0.0516
`Morn Precin
`0.1571
`0.1862
`Rank [“1311
`0.8015 0.4690
`Log 13:51:15
`Frecil utter 10 docs 0.2257 0.2200
`Freclu utter 30 don:- 0.1152 0.1171
`Recall utter 10 dot:-
`0.532§
`0-5351
`Recall Iftcr 30 doc: 0.7670 0.7751
`E. 0.5. 10 docs
`0.7601
`0.762?
`Z. 1-0. 10 00C!
`0-7155 0-71-59
`8. 2.0. 19 docs
`0.8252 0.8238
`E. 0.5. 80 docs
`0.8651
`0.8030
`a. 1.0. 30 docs
`0.8140 0.912:
`E. 2.0. 80 docs
`0.5709
`0.5772
`110-05: queries
`35
`35
`
`Ann;- pruciulan {or 8 111051-3501“. pout:
`-Prec='-
`0.5815
`0.658!
`5 Pru: Chin‘- =
`15.8
`31:11.51:
`lion Rnclll
`Hon Prat”
`115.11! 1151:511
`Lag Prlcll
`Fruit Ifbnr 10 docs
`Pracis "but 30 docs
`Recnll altar 10 docs
`R1c111 510.: 30 doc!
`E. 0,5_ 10 001:!
`E. 1.0. 10 00:5
`5. 2.0. 10 docs
`5. 0.6. 30 docs
`E. 1.0. 30 docs
`5. 2.0. so docs
`1111105: qua-155
`
`2
`1
`0.02741 0.0575
`0.7962 0.8503
`0.2140
`0.8044
`0.7079 0.7037
`0.5200 0.6837
`0.1500 0.5156
`0.3003 0.3300
`0.5117 0.5931
`0.5050 0.464?
`0.8088
`0.5098
`0.5703
`0.5375
`0.5339 0.460?
`0.1005
`0.1280
`0.1455 0.3505
`30
`30
`
`(a) ADI
`
`(b) Medlars
`
`u3l9—
`
`(cid:20)(cid:20)
`11
`
`
`
`REFERENCES
`
`[ATTAR77]
`
`Attar, R., and A.S. Fraenkel, Local Feedback in Full-text Retrieval
`Systems, Journal ofthe ACM, 24(3)::397-417, 1977.
`
`[BUCKLEY85]
`
`Implementation 01' the Smart Information Retrieval
`Buckley, (3.,
`System, Tech. Report 85-686, Dept. of Computer Science,
`Cornell University, May 1985.
`
`[CRAWFORD75] Crawford, R., The Computation of Discrimination Values,
`Information Processing and Management, 11:249-253. 1975.
`
`[CR-OUCHBB]
`
`An Analysis of Approximate Versus Exact
`Crouch, C.,
`Discrimination Values, Information Processing and Management,
`24(1):5-16, 1988.
`
`[Foxasa]
`
`[FOX83b]
`
`Fox, E., Extending the Boolean and Vector Space Models of
`Information Retrieval with P-norm Queries and Multiple Concept
`Types, PhD. Thesis, Cornell University. Aug. 1983.
`
`Fox, E., Characteristics of Two New Experimental Collections in
`Computer and InformatiOn Science Containing Textual and
`Bibliographic Concepts, Tech. Report 83-561, Dept. of Computer
`Science, Cornell University, Sept. 1983.
`
`[SALTON75a]
`
`Salton, G., 0.3. Yang, and C.T. Yu, A Theory of Term Impedance
`in Automatic Text Analysis,
`JournaI of the A818 , 26(1):33—44,
`1975.
`
`[SALTON75b]
`
`Salton, G., A. Wong, and C. S. Yang, A Vector Space Model for
`Automatic Indexing, Communications of the ACM, 182613—20,
`1975.
`
`[SALTON76]
`
`Salton, G., and A. Wong, On the Role of Words and Phrases in
`Automatic Text Analysis, Computers and Humanities, 10:69-87,
`1976.
`
`[SALTON87]
`
`Salton, G. and C. Buckley, Term Weighting Approaches in
`Automatic Text Retrieval, Tech. Report 87-881, Dept. of Computer
`Science, Cornell University, Nov. 1987.
`
`[VANRIJSB79]
`
`Information Retrieval, 2nd edition,
`van Rijsbergen, C.J.,
`Butterworths, London, England, 1979.
`
`[VOORHEESBS] Voorhees, E., The Effectiveness and Efficiency of Agglomerative
`Hierarchic Clustering in Document Retrieval, Tech. Report 85-
`705, Dept. of Computer Science, Cornell University, Oct. 1985.
`
`Implementing Agglomerative Hierarchic Clustering
`[VOORHEES86] Voorhees, E.,
`Algorithms for Use in Document Retrieval, Tech. Report 86-765,
`Dept. of Computer Science, Cornell University. July 1986.
`
`[WlLLET85}
`
`Willet, P., An Algorithm for the Calculation of Exact Term
`Discrimination Values,
`Information Processing
`and
`Management, 21 (3):225-232, 1985.
`
`—-320—
`
`(cid:20)(cid:21)
`12
`
`