`
`Carolyn J. Crouch
`Department of Computer Science
`Tulane University
`New Orleans, LA 70118
`
`The importance of a thesaurus in the successful operation of an information
`retrieval system is well recognized. Yet techniques which support the automatic
`generation of thesauri remain largely undiscovered. This paper describes one
`approach to the automatic generation of global
`thesauri, based on the
`discrimination value model of Salton, Yang, and Yu and on an appropriate
`clustering algorithm. This method has been implemented and applied to two
`document collections. Preliminary results indicate that this method, which produces
`improvements in retrieval performance in excess of 10 and 15 percent in the test
`collections, is viable and worthy of continued investigation.
`
`INTRODUCTION
`
`A major factor in the successful operation of any information retrieval system is
`
`the set of dictionaries available for use by the system. Of these dictionaries (e.g.,
`
`thesauri. statistical and/or syntactic phrase dictionaries. term hierarchies. etc.), the
`
`dictionary having the greatest potential
`
`impact on system performance is
`
`undoubtedly the thesaurus. Although the benefits accruing from the use of a well
`
`constructed thesaurus in terms of increased system performance are well
`
`recognized, the methodology for automatically creating such a thesaurus remains
`
`unspecified.
`
`In fact, virtually all thesauri presently in use are idiosyncratic.
`
`Thus a topic of considerable interest to researchers aiming to improve the
`
`overall performance of information retrieval systems is automatic thesaurus
`
`construction. This paper describes an approach to the automatic construction of a
`
`global thesaurus based on the discrimination value model of Salton, Yang, and Yu
`
`[SALTON75a] and on an appropriate clustering algorithm. The discrimination value
`
`model itself is based on the vector space model described below.
`
`
`
`Permission to copy without fee all part of this material is granted provided that the copies are not made or distri—
`buted for direct commercial advantage, the ACM copyright notice and the title of the publication and its date
`appear. and notice is given that copying is by permission of the Association for Computing Machinery. To copy
`otherwise. or to republish, requires a fee and/or specific permission.
`
`C
`
`1988
`
`ACM
`
`0—89791-274-8
`
`0309060088 $ 1,50
`
`
`
`
`
`
`
`—309—-
`
`1
`1
`
`EX1022
`EX 1 022
`
`
`
`The Vector Space Model
`
`One of the major models in information retrieval is the vector space model.
`
`This model views each document in the document collection as a set of unique
`
`words or word types. Each document: may then be regarded as a term vector, and
`
`the complete document collection becomes a vector space of dimension m, where
`
`m is the number of word types in the coilection.
`In the vector space model, a
`document vector, dj, is represented by a set of terms. djk, 1 5 k<_m, where djk
`
`represents the frequency (or weight) of term k in document j (i.e., the number of
`
`times term k appears in document j).
`
`if djk = 0, term k does not appear in document
`
`d]. Queries, like documents, are represented by weighted term vectors.
`
`Given any two term vectors,
`
`the similarity between the vectors may be
`
`assumed to be inversely related to the angle between them.
`
`If the two vectors
`
`coincide, the angle between them is zero, and the vectors are identical.
`
`in two
`
`dimensions,
`
`the veCtor space may be represented by its envelope.
`
`The
`
`(normalized) vectors are then viewed as points in the vector space, and the
`
`distance between any two points is inversely related to the similarity of the
`
`corresponding document vectors. The smatler the distance between two points, the
`
`smaller the angle between the corresponding vectors, and the greater the similarity
`
`of the vectors in terms of the number of word types they have in common.
`
`Salton at al [SALTON75a, SALTON75b, SALTON76] have shown that the
`
`best document space for retrieval purposes is one which maximizes the average
`
`separation between documents in the document space.
`
`in this space, it is easier to
`
`distinguish between documents and thus easier to retrieve documents which are
`
`most similar to a given query. The model which allows the terms in a collection to
`
`be ranked in order of their effect on space density is called the discrimination value
`model.
`
`The Discrimination Value Model
`
`The discrimination value model [SALTON75a} assigns specific roles to singie
`
`terms, term phrases, and term classes for content analysis purposes and provides a
`
`framework within which each potential index term in a collection can be ranked in
`accordance with its usefulness as a document discriminator.
`It also offers a
`
`reasonable physical interpretation of the indexing process.
`
`if we consider a collection of documents, each represented by a set of
`
`weighted m-dimensional vectors, then the similarity coefficient computed between
`
`any two term vectors can be interpreted as a measure of the closeness or
`
`relatedness between the vectors in m—space.
`
`It the similarity coefficient is large, the
`
`documents are very similar and appear in close proximity to each other in the
`
`document space. And if the similarity coefficient is small, the documents exhibit littte
`
`similarity and are widely separated in the document space.
`
`—310 ,
`
`2
`
`
`
`The discrimination value of a term is then defined as a measure of the change
`
`in space separation which occurs when a given term is assigned to the document
`
`collection. A good discriminator is a term which, when assigned to a document,
`
`decreases the space density (i.e., renders the documents less similar to each
`
`other). Conversely, the assignment of a poor discriminator increases the space
`
`density. By computing the density of the document space before and after the
`
`assignment of each term, the discrimination value of the term can be determined.
`
`The terms can then be ranked in decreasing order of their discrimination values.
`
`Salton, Yang, and Yu [SALTON75a] have used discrimination value to
`
`determine three categories of discriminators, namely, good. poor, and indifferent
`
`discriminators. A term with a positive discrimination value has been found to be a
`
`good discriminator. Salton et at suggest that these terms be used directly as index
`
`terms. Those terms with negative discrimination values are poor discriminators; the
`
`retrieval properties of such terms can be transformed by including them in
`
`appropriate phrases. The majority of terms are indifferent discriminators with near-
`
`zero discrimination values. The retrieval capabilities of these terms can be
`
`enhanced by means of their incorporation in appropriate thesaurus classes.
`
`Thus the discrimination value model presents a criterion for the formation of
`global thesauri. According to this model. a thesaurus is composed of a‘ set of
`
`thesaurus classes. A thesaurus class is composed of a group of terms or word
`
`types. The terms within a class should be indifferent discriminators (i.e., those with
`
`near-zero discrimination values). Thus in order to use the criterion suggested by
`
`the discrimination model, the discrimination value of each term in the collection
`
`must be calculated and the terms ranked as good,
`
`indifferent and poor
`
`discriminators according to their discrimination values.
`
`But the calculation of discrimination value is normally expensive. Two different
`
`approaches have been used. One approach, the so-called exact method, involves
`
`the calculation of all pairwise similarities between the document vectors of the
`
`collection. For a collection of n documents and m word types, the complexity of this
`algorithm is 0(mn2). The second or approximate approach to the calculation of
`
`discrimination value involves the construction of an artificial, average document, the
`
`centroid, and computes the sum of the similarities of each document with the
`
`centroid. The centroid algorithm is O(mn).
`
`Modifications have been suggested which improve the execution times
`
`associated with both the exact and the approximate methods of calculating
`
`discrimination value [CRAWFORD75, WiLLET85, CROUCHBS]. Although the
`
`discrimination values produced by these two approaches differ significantiy for a
`
`particular collection, it has been shown that the rankings of the terms are in fact
`
`highly compatible [CROUCHBB]. Thus of these two methods, the more efficient,
`
`centroid approach is the obvious method of choice when discrimination values must
`be calculated.
`
`~3ll~
`
`3
`
`
`
`But for a document collection of any size, even the centroid approach to the
`
`calculation of discriminatiOn value may be expensive. A. reasonable alternative has
`
`been provided by Salton, Yang, and Yu, who suggest the use of document
`
`frequency as an approximation to discrimination value.
`
`For any term k in a
`
`- document collection,
`
`the document frequency of term k, dk,
`
`is defined as the
`
`number of documents in which term k appears. Empirical results indicate that
`
`document frequency and discrimination value are strongly correlated.
`
`Let n
`
`represent the number of documents in a collection whose terms are ranked by
`
`increasing document frequency. According to [SALTON7Sa], those terms whose
`
`document frequency is less than n/1DO may be considered low frequency terms.
`
`The discrimination values of these terms are normally near-zero. and these terms
`
`as a whole may be considered indifferent discriminators- Likewise, terms whose
`
`document frequencies are greater than n/10 may be considered high frequency
`
`terms. These terms normally have negative discrimination values and are
`
`considered poor discriminators. The remaining terms (i.e.. nlto s dk s n/100) make
`
`up the set of good discriminators. The discrimination values of these terms are
`
`positive.
`
`Thus document frequency may be used as an approximation to discrimination
`
`value. Thesaurus classes, which theoretically should consist of groupings of terms
`
`with near-zero discrimination values. may instead be constructed of sets of low
`
`frequency terms. Since document frequency is readily available for every term in a
`
`collection. the cost associated with this approach is minimal.
`
`AN APPROACH TO THESAURUS CONSTRUCTION
`
`An experiment was designed to investigate the feasibility of constructing a
`
`global thesaurus based on low frequency terms. The term "global thesaurus" is
`
`used to differentiate this type of thesaurus from the "local thesaurus" described by
`
`Attar and Fraenkel [ATTAFt77].
`
`in a. global approach. thesaurus classes. once
`
`constructed, are used to index both documents and queries. The local thesaurus, in
`
`contrast, uses information obtained from the documents retrieved in response to a
`
`particular query to modify that query, which is then resubmitted to the retrieval
`
`system for processing in lieu of the original. Thus a global thesaurus is constructed
`
`prior to the indexing process and the thesaurus classes are used to index both
`
`documents and queries. whereas a local thesaurus is constructed dynamically
`
`during query processing and uses information retrieved in response to a specific
`
`query to modify only that query.
`
`Constructing Thesaurus Classes
`
`Constructing a global thesaurus based on the discrimination value model calls
`
`for the generation of thesaurus classes consisting of indifferent discriminators or (as
`
`a viable alternative) low frequency terms. The question ofthow the classes
`
`-—31%
`
`4
`
`
`
`themselves are to be constructed remains open.
`
`intuitively, a thesaurus class
`
`should consist of terms which are closely related in the context of the current
`
`collection.
`
`(Such terms are not necessarily synonyms in the conventional sense.)
`
`One approach to generating groups of closely related terms is to cluster all the
`
`terms in the collection. The low frequency terms which cluster together might then
`be considered a thesaurus class. Unfortunatelyfiin an environment where there is
`
`little information to exploit, the resultant clusters are seldom meaningful. This is the
`
`case in the low frequency domain. where terms are contained in only a small
`number of the documents in the collection.
`
`An alternative approach is to cluster the documents of the collection and to
`
`generate thesaurus classes from the low frequency terms contained in the
`
`document clusters. A key question then arises, namely, what type of clustering
`
`algorithm should be used? The choice is dictated largely by the type of clusters an
`algorithm produces.
`In order to generate meaningful thesaurus Classes,
`the low
`
`frequency terms in a class should come from closely related documents. This
`
`implies that the document clusters themselves should be small and tight. An
`
`algorithm which produces clusters of this type is the complete-link clustering
`
`algorithm, one of a class of agglomerative, hierarchical clustering algorithms that
`
`has received some attention in the literature
`
`[VANRlJSB79. VOORHEESBS,
`
`VOORHEE586]. Consequently,
`documents in our test collections.
`
`this was the algorithm used to cluster the
`
`Constructing a Global Thesaurus
`
`The following procedure has been utilized to construct a global thesaurus:
`
`1. The document collection is clustered via the complete-link algorithm.
`
`2. The resultant hierarchy is traversed and thesaurus classes are generated,
`
`based on specified, user-supplied parameters.
`
`3. The documents and queries are indexed by the thesaurus classes.
`
`The characteristics of
`
`the thesaurus classes generated in step 2 are
`
`determined by the following, user-supplied parameters:
`
`(a)
`
`THRESHOLD VALUE
`
`Application ‘of the complete-link clustering algorithm produces a
`hierarchy in which the tightest clusters (i.e., those which cluster at the
`highest threshold values) lie at the bottom of the cluster tree. These nodes
`are the leaves of the tree. For example, consider Fig. 1. The squares
`represent documents and the numbers in the circles represent the levels at
`which the documents cluster. Documents A and B cluster at a threshold
`value of 0.089, D and E cluster at a level of 0.149, and document C clusters
`with the D-E subtree at a level of 0.077. The A-B subtree and the C-D-E
`subtree cluster at a threshold value of 0.029.
`
`The user—supplied threshold value largely determines the documents
`from which terms are selected for inclusion in a thesaurus class.
`In Fig. 1, a
`threshold value of 0.090 would return only the D-E document cluster, since
`
`—-313—
`
`5
`
`
`
` fit
`
`a a
`
`Fig. 1 A sample hierarchy generated by the complete link algorithm
`
`
`the level at which the documents cluster must be greater than or equal to
`the specified threshold in order for the cluster to be recognized.
`A
`threshold value of 0.085, on the other hand, would return two clusters (A-B
`and D-E).
`It a threshold value of 0.075 were specified, two clusters would
`be returned: A-B and either D-E or C-D-E.
`
`(b) NUMBER OF DOCUMENTS IN A CLUSTER
`
`Since both D-E and C—D-E meet the threshold value of 0.075 in Fig. 1,
`another parameter is needed to determine the number of documents to be
`included in the final cluster. This parameter, supplied by the user, is the
`number of documents in a cluster.
`in the example, a value of 2 for this
`parameter would insure that only the D-E cluster would be returned. A
`value of 3 or greater would allow the entire C—D—E cluster to be returned.
`
`(c) MINIMUM DOCUMENTFFIEOUENCY
`
`Once the document clusters have been selected, thesaurus classes
`are formed from the low frequency terms contained in those clusters.
`Before such terms can be selected, the minimum document frequency of
`each term,
`i.e., the criterion by which terms are determined to be “low
`frequency," must be specified. The guidelines in [SALTON75a] may be
`helpful in setting this parameter.
`
`(d)
`
`SPECIFICATION OF CLASS FOHMA TION METHOD
`
`When the document clusters have been determined (via the settings of
`the first two parameters) and the low frequency terms from the clustered
`documents have been gathered, the thesaurus classes themselves can be
`generated. The simplest approach is to define a thesaurus class as the
`intersection of all the low frequency terms in a cluster. Alternate methods,
`using (1) union and (2) restricted intersection (ranking the terms in the
`intersection and using the top ranked terms from that set) in tieu of the
`intersection, were also implemented.
`to be the most
`The most successful approach turned out
`straightforward. The best
`results were obtained by using simple
`intersection. The experimental results described in this paper are based on
`this approach.
`
`_.314_
`
`6
`
`
`
`Once the thesaurus classes are generated, a collection can be indexed by
`
`replacement or augmentation (see step 3 above). Replacement entails replacing a
`
`term in a vector by a thesaurus class which contains that term. Augmentation
`
`involves adding the thesaurus class to the vector; the original term remains in the
`
`vector, although its weight may be modified.
`
`(For a discussion of term weighting
`
`methods, see [SALTON87].) Both methods were implemented in this experiment.
`
`The results obtained by augmentation are in every case vastly superior to those
`
`obtained through replacement , and thus this discussion focuses on indexing by
`
`augmentation.
`
`Another factor of some interest during the indexing process is the weighting of
`
`the new thesaurus classes. The classes themselves must be weighted so as to
`
`ensure that they neither override the original vector terms nor have a negligible
`
`effect during retrieval. An intuitively appealing
`
`approach is to compute the
`
`thesaurus class weight as a function of the terms contained in that class. As a
`
`result, the thesaurus class weights in this experiment were computed according to
`
`the following formula. where tc_wt denotes the thesaurus class weight, tc_con_wti
`
`denotes the weight of term i in this thesaurus class, ltc_con| represents the number
`
`of terms in the thesaurus class, and avg_tc__con_wt represents the average weight
`of a term in the thesaurus class. Then
`
`a n d
`
`avg_tc_co n_wt =
`
`itc_con|
`Z tc_con_wti
`i=1
`[tc_conl
`
`tC_Mfl =
`
`avg_tc_co n_wt
`itc_con|
`
`* 0‘5
`
`Thus a thesaurus class weight is computed by dividing the average weight of a term
`
`in this class by the number of terms in the class and downweighting that value by
`
`0.5. The experimental results indicate that this approach is indeed effective.
`
`THE EXPERIMENT
`
`Methodology
`
`Global thesauri were constructed for two test collections using the procedure
`
`described in the previous section. To recap:
`
`the collections were clustered using
`
`the complete-link algorithm. The hierarchies were traversed and thesaurus classes
`
`were generated based on user-supplied parameters (threshold value, number of
`
`documents in a cluster, minimum document frequency, and specification of class
`
`formation method). The documents and queries were then indexed by the
`
`thesaurus classes.
`
`in all cases, the indexing was performed by augmentation, and
`
`the thesaurus class weights were generated based on the weighting scheme
`
`—3lS—
`
`7
`
`
`
`detailed in the previous section. The weights of the original terms were not
`modified.
`
`Setting the Parameters
`
`With reference to the parameters used in step 2 of the thesaurus construction
`
`procedure:
`
`(a) Threshoid value is collection-dependent. After a collection is clustered.
`
`appropriate threshold vaiues may be determined by examining the cluster tree. The
`
`setting of the threshold value is the most vital factor in forming the thesaurus
`
`classes. This setting in conjunction with the class formation method determines the
`
`size of the thesaurus ciass.
`
`if the setting is too high, thesaurus classes will be
`
`gene-rated with either too many terms (using union for class formation) or with too
`
`few terms (using intersection).
`
`It the threshold is too low. very few classes will be
`
`generated.
`
`(b) The number of documents in a cluster obviously also affects thesaurus
`
`class. size, but the overall effect of this parameter is considerably less than that of
`
`the threshold value. An initial premise of this experiment is that terms in a
`
`thesaurus class should COme from closely related documents, which in turn implies
`
`that the document clusters themselves must be small and tight. For investigative
`
`purposes this parameter was allowed to range from 2 to 5 in our test runs, but it is
`
`always subject to (i.e., governed by) the threshold value and in most cases, the
`
`actual number of documents in a cluster ranges from 2 to 3.
`
`(0) Minimum document
`
`frequency can be set by using the guidelines
`
`suggested in [SALTON75a].
`
`(d) Class formation method is the parameter which specifies exactly how the
`thesaurus class itself is formed once the document cluster has been determined. Of
`
`the three approaches investigated (union, intersection, and restricted intersection),
`
`simple intersection gives the best results in all cases. Thus a thesaurus composed
`
`of thesaurus classes consisting of sets of
`
`low frequency terms common to a small
`
`group of closely related documents appears to have the greatest potential for
`
`effecting an improvement in system performance.
`
`Collection Weighting
`
`Once the global
`
`thesaurus has been constructed and the collection
`
`(documents and queries) indexed by the thesaurus classes, the vectors are then
`
`weighted appropriately, using facilities made available through the Smart
`
`experimental
`
`retrieval system [BUCKLEYSS].
`
`(The original vectors. before
`
`modification, are weighted by simple term frequency.) Smart offers a number of
`
`options with respect to weighting. The most common approach is probably to
`
`weight the vectors using "tf-idf" weights (i.e., term frequency multiplied by inverse
`
`document frequency). This scheme is appealing because it allows a term to be
`
`weighted by a factor dependent upon both its importance within a document (its
`——3m~
`
`8
`
`
`
`term frequency) and its importance within the document colleCtion as a whole (its
`
`document frequency). Moreover,
`
`tf-idf weighting has proved most effective in
`
`comparison with other weighting schemes in numerous experiments.
`
`The collection weighting method used in this experiment is a variation of the If-
`
`idf scheme. wherein the term frequency component is referred to as "augmented
`
`normalized term frequency."
`
`in this method, the term frequency is normalized and
`
`augmented and is thus guaranteed to lie between 0.5 and 1.0. This component
`
`then replaces the tf component in the tf-idf scheme to generate the weight for this
`term in the modified document collection.
`
`Similarity Measure
`
`A common measure of the similarity between two vectors di and dk is the
`cosine function
`
`d'-dk
`COS d.d = —""'_L‘—"'—
`‘
`’
`'0
`”din lldkll
`where dj-dk is the inner product and “di H2 = (dj -dj). This is the measure used to
`compute similarity in this experiment.
`
`Collection Characteristics
`
`The thesaurus construction procedure described in this paper was run on two
`
`standard test collections. namely, ADI and Medlars. The characteristics of these
`
`collections are found in Table l.
`
`(For more complete descriptions of these
`
`collections. see [FOX83A, FOX83b, VOORHEESBS}.)
`
`Table |
`
`Collection Statistics
`
`
`Collection
`
`Number of
`
`Vectors
`
`Number of
`
`Mean Number of
`
`Terms
`
`Terms per Vector
`
`ADI
`
`Medlars
`
`82
`
`1033
`
`822
`
`6927
`
`25.5
`
`51.6
`
`—317~
`
`9
`
`
`
`Empirical Results
`
`The results of the experiment, as applied to the ADI and lv‘ledlars collections,
`
`are displayed in Table II. The normal evaluation measures, precision and recall,
`
`are used.
`
`(Recall is defined as the proportion of relevant material retrieved, while
`
`precision is the proportion of retrieved material that is relevant.) Part (a) gives the
`
`statistics in terms of precision and recall when the thesaurus generation procedure
`
`is applied to the ADl collection (using in this case the following parameters: a
`
`threshold level of 0.075, 5 documents per cluster, a minimum document frequency
`
`of 20. and intersection as the class formation method). The statistics which apply to
`
`the original, unmodified ADI collection are found in column 1. Column 2 pertains to
`
`the modified ADI collection after application of the global thesaurus. The average
`
`precision at 3 points of recall (0.25, 0.50, and 0.75) is calculated for each collection
`
`and shows an overall improvement of 10.6 per cent when the global thesaurus is
`
`appfied.
`
`Part (D) gives a similar view when the procedure is applied to the Medlars
`
`collection (with corresponding parameter settings of 0.120, 3, 50, and intersection).
`
`Again, column 1 applies to the original Medlars collection and column 2 applies to
`
`same collection after application of the global thesaurus. The improvement in
`
`precision (based on the 3 point average) of the modified collection over the original
`
`is 15.8 per cent.
`
`SUMMARY
`
`An experiment was designed to investigate the feasibility of constructing a
`
`global thesaurus based on the discrimination value model and an appropriate
`
`clustering algorithm.
`
`A thesaurus construction procedure was designed,
`
`implemented, and applied to two standard test collections. The preliminaiy results
`
`show that substantial
`
`improvements can be effected by making a properly
`
`constructed global thesaurus available to a retrieval system during processing.
`
`Moreover, as the empirical results indicate.
`
`it
`
`is possible to construct such a
`
`thesaurus automatically. The question of whether it is possible to construct such a
`thesaurus for other collections remains to be answered.
`
`73187
`
`10
`10
`
`
`
`TABLE It
`
`A Comparison of Retrieval Statistics:
`
`The Original versus the Thesaurus-Based Collection
`
`W
`
`colllcnon: M1
`
`tux-005010:
`
`.075
`
`no. docs: 5
`
`00110601011: nod
`
`tin-"hold:
`
`.120
`
`no. 000-: 0
`
`1100311 ‘- Prociaiol
`LIVE].
`1
`0.00
`0.0858
`0.05
`0.0850
`0.10
`0.0050
`0.16
`0.0770
`0.20
`0.0214
`0.25
`0.5507
`0.00
`0.5497
`0-35
`0-5020
`0.40
`0.4090
`0.45
`0.4790
`0.50
`0.4700
`0-55
`0.3600
`0-30
`°~3€00
`0.05
`0.3515
`0.70
`0.2000
`0.75
`0.3616
`0.00
`0.2370
`0.80
`0.2204
`0.00
`0.2202
`0.05
`0.2230
`1.00
`0.2230
`
`2
`0.107%
`0.7074
`0.1017
`0.0012
`0.0348
`0.5000
`0.5760
`0-5309
`0.5301
`0.5201
`0.5100
`0-429!
`0-4291
`0.4192
`0.0072
`0-3372
`0.3015
`0.2082
`0.2045
`0.2837
`0.2037
`
`lac-11 - Ptlcillon
`Lnal
`l
`0.00
`0.0205
`0.05
`0.0427
`0.10
`0.7995
`0.15
`0.7582
`0.20
`0.7110
`0.25
`‘0.7217
`0.30
`0.7009
`0.35
`0.0801
`oJo
`13435
`0.15
`0.5050
`0.50
`0.5111
`0.55
`0.5219
`0.50
`0.4025
`0.05
`0.0007
`ago
`oJua
`0.75
`0.0070
`0.00
`0.3107
`mas
`0,2015
`0.00
`0.2209
`0.05
`0.1350
`1.00
`0.0050
`
`I
`0.0017
`0.9270
`0.0790
`0.0130
`0.0197
`0.0058
`0.7017
`0.7510
`0.719;
`0.5910
`0.0500
`0-0305
`0.0000
`o_551a
`o_5257
`0.4902
`0.1500
`0.1100
`0.3013
`0.1007
`0.1550
`
`.
`
`Averlgi proclaim: 101' 0 intltlidilti p011“!
`Prac:
`0.4550
`0.1815
`1 Free Chins! =
`10.0
`2
`1
`31.00100“:
`0.7040 0.0015
`"or! Bacall
`0.0311
`0.0515
`Nor: Precin
`0.1571
`0.1062
`Rink Ratlll
`0.0015 0.1099
`Log Pretil
`Precil liter 10 docs 0.2257 0.2200
`Frecls utter 30 don:- 0.1152 0.1171
`Recall utter 10 do:- 0.532!
`0.5351
`Recall titer 30 teen 0.7070 0.7751
`E. 0.6. 10 docs
`0.7001
`0.7020
`2. 1.0.
`to docs
`0.7165 0-7169
`E. 2.0. 10 doc:
`0.0252 0.0288
`E. 0.5. 30 00:0
`0.0051
`0.0030
`[-2. 1.0 30 docs
`0.0140 0.0122
`5. 2-0. 3° 00:8
`0.5799
`0.0772
`Hulba: queriu
`35
`35
`
`Aura;- praculan {or 0 101.011.0101.. point:
`-Prec=
`0.5515
`0.0500
`5 Pruc Chung. =
`15.0
`80301001:
`lcrl Rnclll
`Hon Prat”
`Bull 1100011
`Lag Prlcll
`Pracis Ifbnr 10 docs
`Prucia "but 80 docs
`Realll altar 10 docs
`RICIll urn-r 30 00:-
`E. o_5_ 10 docs
`E. 1.0. 10 00::
`5. 2.0. 10 docs
`E. 0.5. 00 00:!
`1:. 1.0. 30 docs
`5. 2.0. 30 00:.
`fluzbar qugriaa
`
`2
`1
`0.02741 0.0575
`0.7902 0.8503
`0.2140
`0.0044
`0.7070 0.7037
`0.0200 0.0007
`0.1500 0.5160
`0.5003 0.3300
`0.5117 0.503:
`0,5050
`0.1547
`0.8088
`0.5593
`0.0703
`0.0575
`0.5330
`0.4007
`0.4000
`0.4280
`0.4450 0.3000
`30
`30
`
`(a) ADI
`
`(b) Medlars
`
`u3l9—
`
`11
`11
`
`
`
`REFERENCES
`
`[ATTA R77]
`
`Attar, R., and A.S. Fraenkel, Local Feedback in Full-text Retrieval
`Systems, Journal ofthe ACM, 24(3):397-417, 1977.
`
`[BUCKLEY85]
`
`Implementation 01' the Smart Information Retrieval
`Buckley, (3.,
`System, Tech. Report 85-686, Dept. of Computer Science,
`Cornell University, May 1985.
`
`[CRAWFORD75]
`
`Crawford, R., The Computation of Discrimination Values,
`Information Processing and Management, 11:249-253, 1975.
`
`[CROUCHBB]
`
`An Analysis of Approximate Versus Exact
`Crouch, C.,
`Discrimination Values, Information Processing and Management,
`24(1):5-16, 1988.
`
`[Foxasa]
`
`[FOXBSb]
`
`Fox, E., Extending the Boolean and Vector Space Models of
`Information Retrieval with P-norm Queries and Multiple Concept
`Types, PhD. Thesis, Cornell University, Aug. 1983.
`
`Fox, E., Characteristics of Two New Experimental Collections in
`Computer and Informatiori Science Containing Textual and
`Bibliographic Concepts, Tech. Report 83-561, Dept. of Computer
`Science, Cornell University, Sept. 1983.
`
`[SALTON75a]
`
`Salton, G., 0.8. Yang, and C.T. Yu, A Theory of Term Importance
`in Automatic Text Analysis,
`Journal of the ASIS , 26(1):33—44,
`1975.
`
`[SALTON75b]
`
`Salton, G., A. Wong, and C. S. Yang, A Vector Space Model for
`Automatic Indexing, Communications of the ACM, 18:613—20,
`1975.
`
`[SALTON76]
`
`Salton, G., and A. Wong, On the Role of Words and Phrases in
`Automatic Text Analysis, Computers and Humanities, 10:69-87,
`1976.
`
`[SALTON87]
`
`Salton, G. and C. Buckley, Term Weighting Approaches in
`Automatic Text Retrieval, Tech. Report 87-881, Dept. of Computer
`Science, Cornell University, Nov. 1987.
`
`[VANRIJSB79]
`
`Information Retn’evaI, 2nd edition,
`van Rijsbergen, C.J.,
`Butterworths, London, England, 1979.
`
`[VOORHEESBS]
`
`Voorhees, E., The Effectiveness and Efficiency of Agglomerative
`Hierarchic Clustering in Document Retrieval, Tech. Report 85
`705, Dept. of Computer Science, Cornell University, Oct. 1985.
`
`[VOORHEESBS]
`
`Implementing Agglomerative Hierarchic Clustering
`Voorhees, E.,
`Algorithms for Use in Document Retrieval, Tech. Report 86-765,
`Dept. of Computer Science, Cornell University. July 1986.
`
`[WILI_ET85}
`
`WiIIet, P., An Algorithm for the Calculation of Exact Term
`Discrimination Values,
`Information Processing
`and
`Management, 21 (3):225-232, 1985.
`
`—320—
`
`12
`12
`
`