throbber
A CLUSTER-BASED APPROACH TO THESAURUS CONSTRUCTION
`
`Carolyn J. Crouch
`Department of Computer Science
`Tulane University
`New Orleans, LA 70118
`
`The importance of a thesaurus in the successful operation of an information
`retrieval system is well recognized. Yet techniques which support the automatic
`generation of thesauri remain largely undiscovered. This paper describes one
`approach to the automatic generation of global
`thesauri, based on the
`discrimination value model of Salton, Yang, and Yu and on an appropriate
`clustering algorithm. This method has been implemented and applied to two
`document collections. Preliminary results indicate that this method, which produces
`improvements in retrieval performance in excess of 10 and 15 percent in the test
`collections, is viable and worthy of continued investigation.
`
`INTRODUCTION
`
`A major factor in the successful operation of any information retrieval system is
`
`the set of dictionaries available for use by the system. Of these dictionaries (e.g.,
`
`thesauri. statistical and/or syntactic phrase dictionaries. term hierarchies. etc.), the
`
`dictionary having the greatest potential
`
`impact on system performance is
`
`undoubtedly the thesaurus. Although the benefits accruing from the use of a well
`
`constructed thesaurus in terms of increased system performance are well
`
`recognized, the methodology for automatically creating such a thesaurus remains
`
`unspecified.
`
`In fact, virtually all thesauri presently in use are idiosyncratic.
`
`Thus a topic of considerable interest to researchers aiming to improve the
`
`overall performance of information retrieval systems is automatic thesaurus
`
`construction. This paper describes an approach to the automatic construction of a
`
`global thesaurus based on the discrimination value model of Salton, Yang, and Yu
`
`[SALTON75a] and on an appropriate clustering algorithm. The discrimination value
`
`model itself is based on the vector space model described below.
`
`
`
`Permission to copy without fee all part of this material is granted provided that the copies are not made or distri—
`buted for direct commercial advantage, the ACM copyright notice and the title of the publication and its date
`appear. and notice is given that copying is by permission of the Association for Computing Machinery. To copy
`otherwise. or to republish, requires a fee and/or specific permission.
`
`C
`
`1988
`
`ACM
`
`0—89791-274-8
`
`0309060088 $ 1,50
`
`
`
`
`
`
`
`—309—-
`
`1
`1
`
`EX1022
`EX 1 022
`
`

`

`The Vector Space Model
`
`One of the major models in information retrieval is the vector space model.
`
`This model views each document in the document collection as a set of unique
`
`words or word types. Each document: may then be regarded as a term vector, and
`
`the complete document collection becomes a vector space of dimension m, where
`
`m is the number of word types in the coilection.
`In the vector space model, a
`document vector, dj, is represented by a set of terms. djk, 1 5 k<_m, where djk
`
`represents the frequency (or weight) of term k in document j (i.e., the number of
`
`times term k appears in document j).
`
`if djk = 0, term k does not appear in document
`
`d]. Queries, like documents, are represented by weighted term vectors.
`
`Given any two term vectors,
`
`the similarity between the vectors may be
`
`assumed to be inversely related to the angle between them.
`
`If the two vectors
`
`coincide, the angle between them is zero, and the vectors are identical.
`
`in two
`
`dimensions,
`
`the veCtor space may be represented by its envelope.
`
`The
`
`(normalized) vectors are then viewed as points in the vector space, and the
`
`distance between any two points is inversely related to the similarity of the
`
`corresponding document vectors. The smatler the distance between two points, the
`
`smaller the angle between the corresponding vectors, and the greater the similarity
`
`of the vectors in terms of the number of word types they have in common.
`
`Salton at al [SALTON75a, SALTON75b, SALTON76] have shown that the
`
`best document space for retrieval purposes is one which maximizes the average
`
`separation between documents in the document space.
`
`in this space, it is easier to
`
`distinguish between documents and thus easier to retrieve documents which are
`
`most similar to a given query. The model which allows the terms in a collection to
`
`be ranked in order of their effect on space density is called the discrimination value
`model.
`
`The Discrimination Value Model
`
`The discrimination value model [SALTON75a} assigns specific roles to singie
`
`terms, term phrases, and term classes for content analysis purposes and provides a
`
`framework within which each potential index term in a collection can be ranked in
`accordance with its usefulness as a document discriminator.
`It also offers a
`
`reasonable physical interpretation of the indexing process.
`
`if we consider a collection of documents, each represented by a set of
`
`weighted m-dimensional vectors, then the similarity coefficient computed between
`
`any two term vectors can be interpreted as a measure of the closeness or
`
`relatedness between the vectors in m—space.
`
`It the similarity coefficient is large, the
`
`documents are very similar and appear in close proximity to each other in the
`
`document space. And if the similarity coefficient is small, the documents exhibit littte
`
`similarity and are widely separated in the document space.
`
`—310 ,
`
`2
`
`

`

`The discrimination value of a term is then defined as a measure of the change
`
`in space separation which occurs when a given term is assigned to the document
`
`collection. A good discriminator is a term which, when assigned to a document,
`
`decreases the space density (i.e., renders the documents less similar to each
`
`other). Conversely, the assignment of a poor discriminator increases the space
`
`density. By computing the density of the document space before and after the
`
`assignment of each term, the discrimination value of the term can be determined.
`
`The terms can then be ranked in decreasing order of their discrimination values.
`
`Salton, Yang, and Yu [SALTON75a] have used discrimination value to
`
`determine three categories of discriminators, namely, good. poor, and indifferent
`
`discriminators. A term with a positive discrimination value has been found to be a
`
`good discriminator. Salton et at suggest that these terms be used directly as index
`
`terms. Those terms with negative discrimination values are poor discriminators; the
`
`retrieval properties of such terms can be transformed by including them in
`
`appropriate phrases. The majority of terms are indifferent discriminators with near-
`
`zero discrimination values. The retrieval capabilities of these terms can be
`
`enhanced by means of their incorporation in appropriate thesaurus classes.
`
`Thus the discrimination value model presents a criterion for the formation of
`global thesauri. According to this model. a thesaurus is composed of a‘ set of
`
`thesaurus classes. A thesaurus class is composed of a group of terms or word
`
`types. The terms within a class should be indifferent discriminators (i.e., those with
`
`near-zero discrimination values). Thus in order to use the criterion suggested by
`
`the discrimination model, the discrimination value of each term in the collection
`
`must be calculated and the terms ranked as good,
`
`indifferent and poor
`
`discriminators according to their discrimination values.
`
`But the calculation of discrimination value is normally expensive. Two different
`
`approaches have been used. One approach, the so-called exact method, involves
`
`the calculation of all pairwise similarities between the document vectors of the
`
`collection. For a collection of n documents and m word types, the complexity of this
`algorithm is 0(mn2). The second or approximate approach to the calculation of
`
`discrimination value involves the construction of an artificial, average document, the
`
`centroid, and computes the sum of the similarities of each document with the
`
`centroid. The centroid algorithm is O(mn).
`
`Modifications have been suggested which improve the execution times
`
`associated with both the exact and the approximate methods of calculating
`
`discrimination value [CRAWFORD75, WiLLET85, CROUCHBS]. Although the
`
`discrimination values produced by these two approaches differ significantiy for a
`
`particular collection, it has been shown that the rankings of the terms are in fact
`
`highly compatible [CROUCHBB]. Thus of these two methods, the more efficient,
`
`centroid approach is the obvious method of choice when discrimination values must
`be calculated.
`
`~3ll~
`
`3
`
`

`

`But for a document collection of any size, even the centroid approach to the
`
`calculation of discriminatiOn value may be expensive. A. reasonable alternative has
`
`been provided by Salton, Yang, and Yu, who suggest the use of document
`
`frequency as an approximation to discrimination value.
`
`For any term k in a
`
`- document collection,
`
`the document frequency of term k, dk,
`
`is defined as the
`
`number of documents in which term k appears. Empirical results indicate that
`
`document frequency and discrimination value are strongly correlated.
`
`Let n
`
`represent the number of documents in a collection whose terms are ranked by
`
`increasing document frequency. According to [SALTON7Sa], those terms whose
`
`document frequency is less than n/1DO may be considered low frequency terms.
`
`The discrimination values of these terms are normally near-zero. and these terms
`
`as a whole may be considered indifferent discriminators- Likewise, terms whose
`
`document frequencies are greater than n/10 may be considered high frequency
`
`terms. These terms normally have negative discrimination values and are
`
`considered poor discriminators. The remaining terms (i.e.. nlto s dk s n/100) make
`
`up the set of good discriminators. The discrimination values of these terms are
`
`positive.
`
`Thus document frequency may be used as an approximation to discrimination
`
`value. Thesaurus classes, which theoretically should consist of groupings of terms
`
`with near-zero discrimination values. may instead be constructed of sets of low
`
`frequency terms. Since document frequency is readily available for every term in a
`
`collection. the cost associated with this approach is minimal.
`
`AN APPROACH TO THESAURUS CONSTRUCTION
`
`An experiment was designed to investigate the feasibility of constructing a
`
`global thesaurus based on low frequency terms. The term "global thesaurus" is
`
`used to differentiate this type of thesaurus from the "local thesaurus" described by
`
`Attar and Fraenkel [ATTAFt77].
`
`in a. global approach. thesaurus classes. once
`
`constructed, are used to index both documents and queries. The local thesaurus, in
`
`contrast, uses information obtained from the documents retrieved in response to a
`
`particular query to modify that query, which is then resubmitted to the retrieval
`
`system for processing in lieu of the original. Thus a global thesaurus is constructed
`
`prior to the indexing process and the thesaurus classes are used to index both
`
`documents and queries. whereas a local thesaurus is constructed dynamically
`
`during query processing and uses information retrieved in response to a specific
`
`query to modify only that query.
`
`Constructing Thesaurus Classes
`
`Constructing a global thesaurus based on the discrimination value model calls
`
`for the generation of thesaurus classes consisting of indifferent discriminators or (as
`
`a viable alternative) low frequency terms. The question ofthow the classes
`
`-—31%
`
`4
`
`

`

`themselves are to be constructed remains open.
`
`intuitively, a thesaurus class
`
`should consist of terms which are closely related in the context of the current
`
`collection.
`
`(Such terms are not necessarily synonyms in the conventional sense.)
`
`One approach to generating groups of closely related terms is to cluster all the
`
`terms in the collection. The low frequency terms which cluster together might then
`be considered a thesaurus class. Unfortunatelyfiin an environment where there is
`
`little information to exploit, the resultant clusters are seldom meaningful. This is the
`
`case in the low frequency domain. where terms are contained in only a small
`number of the documents in the collection.
`
`An alternative approach is to cluster the documents of the collection and to
`
`generate thesaurus classes from the low frequency terms contained in the
`
`document clusters. A key question then arises, namely, what type of clustering
`
`algorithm should be used? The choice is dictated largely by the type of clusters an
`algorithm produces.
`In order to generate meaningful thesaurus Classes,
`the low
`
`frequency terms in a class should come from closely related documents. This
`
`implies that the document clusters themselves should be small and tight. An
`
`algorithm which produces clusters of this type is the complete-link clustering
`
`algorithm, one of a class of agglomerative, hierarchical clustering algorithms that
`
`has received some attention in the literature
`
`[VANRlJSB79. VOORHEESBS,
`
`VOORHEE586]. Consequently,
`documents in our test collections.
`
`this was the algorithm used to cluster the
`
`Constructing a Global Thesaurus
`
`The following procedure has been utilized to construct a global thesaurus:
`
`1. The document collection is clustered via the complete-link algorithm.
`
`2. The resultant hierarchy is traversed and thesaurus classes are generated,
`
`based on specified, user-supplied parameters.
`
`3. The documents and queries are indexed by the thesaurus classes.
`
`The characteristics of
`
`the thesaurus classes generated in step 2 are
`
`determined by the following, user-supplied parameters:
`
`(a)
`
`THRESHOLD VALUE
`
`Application ‘of the complete-link clustering algorithm produces a
`hierarchy in which the tightest clusters (i.e., those which cluster at the
`highest threshold values) lie at the bottom of the cluster tree. These nodes
`are the leaves of the tree. For example, consider Fig. 1. The squares
`represent documents and the numbers in the circles represent the levels at
`which the documents cluster. Documents A and B cluster at a threshold
`value of 0.089, D and E cluster at a level of 0.149, and document C clusters
`with the D-E subtree at a level of 0.077. The A-B subtree and the C-D-E
`subtree cluster at a threshold value of 0.029.
`
`The user—supplied threshold value largely determines the documents
`from which terms are selected for inclusion in a thesaurus class.
`In Fig. 1, a
`threshold value of 0.090 would return only the D-E document cluster, since
`
`—-313—
`
`5
`
`

`

` fit
`
`a a
`
`Fig. 1 A sample hierarchy generated by the complete link algorithm
`
`
`the level at which the documents cluster must be greater than or equal to
`the specified threshold in order for the cluster to be recognized.
`A
`threshold value of 0.085, on the other hand, would return two clusters (A-B
`and D-E).
`It a threshold value of 0.075 were specified, two clusters would
`be returned: A-B and either D-E or C-D-E.
`
`(b) NUMBER OF DOCUMENTS IN A CLUSTER
`
`Since both D-E and C—D-E meet the threshold value of 0.075 in Fig. 1,
`another parameter is needed to determine the number of documents to be
`included in the final cluster. This parameter, supplied by the user, is the
`number of documents in a cluster.
`in the example, a value of 2 for this
`parameter would insure that only the D-E cluster would be returned. A
`value of 3 or greater would allow the entire C—D—E cluster to be returned.
`
`(c) MINIMUM DOCUMENTFFIEOUENCY
`
`Once the document clusters have been selected, thesaurus classes
`are formed from the low frequency terms contained in those clusters.
`Before such terms can be selected, the minimum document frequency of
`each term,
`i.e., the criterion by which terms are determined to be “low
`frequency," must be specified. The guidelines in [SALTON75a] may be
`helpful in setting this parameter.
`
`(d)
`
`SPECIFICATION OF CLASS FOHMA TION METHOD
`
`When the document clusters have been determined (via the settings of
`the first two parameters) and the low frequency terms from the clustered
`documents have been gathered, the thesaurus classes themselves can be
`generated. The simplest approach is to define a thesaurus class as the
`intersection of all the low frequency terms in a cluster. Alternate methods,
`using (1) union and (2) restricted intersection (ranking the terms in the
`intersection and using the top ranked terms from that set) in tieu of the
`intersection, were also implemented.
`to be the most
`The most successful approach turned out
`straightforward. The best
`results were obtained by using simple
`intersection. The experimental results described in this paper are based on
`this approach.
`
`_.314_
`
`6
`
`

`

`Once the thesaurus classes are generated, a collection can be indexed by
`
`replacement or augmentation (see step 3 above). Replacement entails replacing a
`
`term in a vector by a thesaurus class which contains that term. Augmentation
`
`involves adding the thesaurus class to the vector; the original term remains in the
`
`vector, although its weight may be modified.
`
`(For a discussion of term weighting
`
`methods, see [SALTON87].) Both methods were implemented in this experiment.
`
`The results obtained by augmentation are in every case vastly superior to those
`
`obtained through replacement , and thus this discussion focuses on indexing by
`
`augmentation.
`
`Another factor of some interest during the indexing process is the weighting of
`
`the new thesaurus classes. The classes themselves must be weighted so as to
`
`ensure that they neither override the original vector terms nor have a negligible
`
`effect during retrieval. An intuitively appealing
`
`approach is to compute the
`
`thesaurus class weight as a function of the terms contained in that class. As a
`
`result, the thesaurus class weights in this experiment were computed according to
`
`the following formula. where tc_wt denotes the thesaurus class weight, tc_con_wti
`
`denotes the weight of term i in this thesaurus class, ltc_con| represents the number
`
`of terms in the thesaurus class, and avg_tc__con_wt represents the average weight
`of a term in the thesaurus class. Then
`
`a n d
`
`avg_tc_co n_wt =
`
`itc_con|
`Z tc_con_wti
`i=1
`[tc_conl
`
`tC_Mfl =
`
`avg_tc_co n_wt
`itc_con|
`
`* 0‘5
`
`Thus a thesaurus class weight is computed by dividing the average weight of a term
`
`in this class by the number of terms in the class and downweighting that value by
`
`0.5. The experimental results indicate that this approach is indeed effective.
`
`THE EXPERIMENT
`
`Methodology
`
`Global thesauri were constructed for two test collections using the procedure
`
`described in the previous section. To recap:
`
`the collections were clustered using
`
`the complete-link algorithm. The hierarchies were traversed and thesaurus classes
`
`were generated based on user-supplied parameters (threshold value, number of
`
`documents in a cluster, minimum document frequency, and specification of class
`
`formation method). The documents and queries were then indexed by the
`
`thesaurus classes.
`
`in all cases, the indexing was performed by augmentation, and
`
`the thesaurus class weights were generated based on the weighting scheme
`
`—3lS—
`
`7
`
`

`

`detailed in the previous section. The weights of the original terms were not
`modified.
`
`Setting the Parameters
`
`With reference to the parameters used in step 2 of the thesaurus construction
`
`procedure:
`
`(a) Threshoid value is collection-dependent. After a collection is clustered.
`
`appropriate threshold vaiues may be determined by examining the cluster tree. The
`
`setting of the threshold value is the most vital factor in forming the thesaurus
`
`classes. This setting in conjunction with the class formation method determines the
`
`size of the thesaurus ciass.
`
`if the setting is too high, thesaurus classes will be
`
`gene-rated with either too many terms (using union for class formation) or with too
`
`few terms (using intersection).
`
`It the threshold is too low. very few classes will be
`
`generated.
`
`(b) The number of documents in a cluster obviously also affects thesaurus
`
`class. size, but the overall effect of this parameter is considerably less than that of
`
`the threshold value. An initial premise of this experiment is that terms in a
`
`thesaurus class should COme from closely related documents, which in turn implies
`
`that the document clusters themselves must be small and tight. For investigative
`
`purposes this parameter was allowed to range from 2 to 5 in our test runs, but it is
`
`always subject to (i.e., governed by) the threshold value and in most cases, the
`
`actual number of documents in a cluster ranges from 2 to 3.
`
`(0) Minimum document
`
`frequency can be set by using the guidelines
`
`suggested in [SALTON75a].
`
`(d) Class formation method is the parameter which specifies exactly how the
`thesaurus class itself is formed once the document cluster has been determined. Of
`
`the three approaches investigated (union, intersection, and restricted intersection),
`
`simple intersection gives the best results in all cases. Thus a thesaurus composed
`
`of thesaurus classes consisting of sets of
`
`low frequency terms common to a small
`
`group of closely related documents appears to have the greatest potential for
`
`effecting an improvement in system performance.
`
`Collection Weighting
`
`Once the global
`
`thesaurus has been constructed and the collection
`
`(documents and queries) indexed by the thesaurus classes, the vectors are then
`
`weighted appropriately, using facilities made available through the Smart
`
`experimental
`
`retrieval system [BUCKLEYSS].
`
`(The original vectors. before
`
`modification, are weighted by simple term frequency.) Smart offers a number of
`
`options with respect to weighting. The most common approach is probably to
`
`weight the vectors using "tf-idf" weights (i.e., term frequency multiplied by inverse
`
`document frequency). This scheme is appealing because it allows a term to be
`
`weighted by a factor dependent upon both its importance within a document (its
`——3m~
`
`8
`
`

`

`term frequency) and its importance within the document colleCtion as a whole (its
`
`document frequency). Moreover,
`
`tf-idf weighting has proved most effective in
`
`comparison with other weighting schemes in numerous experiments.
`
`The collection weighting method used in this experiment is a variation of the If-
`
`idf scheme. wherein the term frequency component is referred to as "augmented
`
`normalized term frequency."
`
`in this method, the term frequency is normalized and
`
`augmented and is thus guaranteed to lie between 0.5 and 1.0. This component
`
`then replaces the tf component in the tf-idf scheme to generate the weight for this
`term in the modified document collection.
`
`Similarity Measure
`
`A common measure of the similarity between two vectors di and dk is the
`cosine function
`
`d'-dk
`COS d.d = —""'_L‘—"'—
`‘
`’
`'0
`”din lldkll
`where dj-dk is the inner product and “di H2 = (dj -dj). This is the measure used to
`compute similarity in this experiment.
`
`Collection Characteristics
`
`The thesaurus construction procedure described in this paper was run on two
`
`standard test collections. namely, ADI and Medlars. The characteristics of these
`
`collections are found in Table l.
`
`(For more complete descriptions of these
`
`collections. see [FOX83A, FOX83b, VOORHEESBS}.)
`
`Table |
`
`Collection Statistics
`
`
`Collection
`
`Number of
`
`Vectors
`
`Number of
`
`Mean Number of
`
`Terms
`
`Terms per Vector
`
`ADI
`
`Medlars
`
`82
`
`1033
`
`822
`
`6927
`
`25.5
`
`51.6
`
`—317~
`
`9
`
`

`

`Empirical Results
`
`The results of the experiment, as applied to the ADI and lv‘ledlars collections,
`
`are displayed in Table II. The normal evaluation measures, precision and recall,
`
`are used.
`
`(Recall is defined as the proportion of relevant material retrieved, while
`
`precision is the proportion of retrieved material that is relevant.) Part (a) gives the
`
`statistics in terms of precision and recall when the thesaurus generation procedure
`
`is applied to the ADl collection (using in this case the following parameters: a
`
`threshold level of 0.075, 5 documents per cluster, a minimum document frequency
`
`of 20. and intersection as the class formation method). The statistics which apply to
`
`the original, unmodified ADI collection are found in column 1. Column 2 pertains to
`
`the modified ADI collection after application of the global thesaurus. The average
`
`precision at 3 points of recall (0.25, 0.50, and 0.75) is calculated for each collection
`
`and shows an overall improvement of 10.6 per cent when the global thesaurus is
`
`appfied.
`
`Part (D) gives a similar view when the procedure is applied to the Medlars
`
`collection (with corresponding parameter settings of 0.120, 3, 50, and intersection).
`
`Again, column 1 applies to the original Medlars collection and column 2 applies to
`
`same collection after application of the global thesaurus. The improvement in
`
`precision (based on the 3 point average) of the modified collection over the original
`
`is 15.8 per cent.
`
`SUMMARY
`
`An experiment was designed to investigate the feasibility of constructing a
`
`global thesaurus based on the discrimination value model and an appropriate
`
`clustering algorithm.
`
`A thesaurus construction procedure was designed,
`
`implemented, and applied to two standard test collections. The preliminaiy results
`
`show that substantial
`
`improvements can be effected by making a properly
`
`constructed global thesaurus available to a retrieval system during processing.
`
`Moreover, as the empirical results indicate.
`
`it
`
`is possible to construct such a
`
`thesaurus automatically. The question of whether it is possible to construct such a
`thesaurus for other collections remains to be answered.
`
`73187
`
`10
`10
`
`

`

`TABLE It
`
`A Comparison of Retrieval Statistics:
`
`The Original versus the Thesaurus-Based Collection
`
`W
`
`colllcnon: M1
`
`tux-005010:
`
`.075
`
`no. docs: 5
`
`00110601011: nod
`
`tin-"hold:
`
`.120
`
`no. 000-: 0
`
`1100311 ‘- Prociaiol
`LIVE].
`1
`0.00
`0.0858
`0.05
`0.0850
`0.10
`0.0050
`0.16
`0.0770
`0.20
`0.0214
`0.25
`0.5507
`0.00
`0.5497
`0-35
`0-5020
`0.40
`0.4090
`0.45
`0.4790
`0.50
`0.4700
`0-55
`0.3600
`0-30
`°~3€00
`0.05
`0.3515
`0.70
`0.2000
`0.75
`0.3616
`0.00
`0.2370
`0.80
`0.2204
`0.00
`0.2202
`0.05
`0.2230
`1.00
`0.2230
`
`2
`0.107%
`0.7074
`0.1017
`0.0012
`0.0348
`0.5000
`0.5760
`0-5309
`0.5301
`0.5201
`0.5100
`0-429!
`0-4291
`0.4192
`0.0072
`0-3372
`0.3015
`0.2082
`0.2045
`0.2837
`0.2037
`
`lac-11 - Ptlcillon
`Lnal
`l
`0.00
`0.0205
`0.05
`0.0427
`0.10
`0.7995
`0.15
`0.7582
`0.20
`0.7110
`0.25
`‘0.7217
`0.30
`0.7009
`0.35
`0.0801
`oJo
`13435
`0.15
`0.5050
`0.50
`0.5111
`0.55
`0.5219
`0.50
`0.4025
`0.05
`0.0007
`ago
`oJua
`0.75
`0.0070
`0.00
`0.3107
`mas
`0,2015
`0.00
`0.2209
`0.05
`0.1350
`1.00
`0.0050
`
`I
`0.0017
`0.9270
`0.0790
`0.0130
`0.0197
`0.0058
`0.7017
`0.7510
`0.719;
`0.5910
`0.0500
`0-0305
`0.0000
`o_551a
`o_5257
`0.4902
`0.1500
`0.1100
`0.3013
`0.1007
`0.1550
`
`.
`
`Averlgi proclaim: 101' 0 intltlidilti p011“!
`Prac:
`0.4550
`0.1815
`1 Free Chins! =
`10.0
`2
`1
`31.00100“:
`0.7040 0.0015
`"or! Bacall
`0.0311
`0.0515
`Nor: Precin
`0.1571
`0.1062
`Rink Ratlll
`0.0015 0.1099
`Log Pretil
`Precil liter 10 docs 0.2257 0.2200
`Frecls utter 30 don:- 0.1152 0.1171
`Recall utter 10 do:- 0.532!
`0.5351
`Recall titer 30 teen 0.7070 0.7751
`E. 0.6. 10 docs
`0.7001
`0.7020
`2. 1.0.
`to docs
`0.7165 0-7169
`E. 2.0. 10 doc:
`0.0252 0.0288
`E. 0.5. 30 00:0
`0.0051
`0.0030
`[-2. 1.0 30 docs
`0.0140 0.0122
`5. 2-0. 3° 00:8
`0.5799
`0.0772
`Hulba: queriu
`35
`35
`
`Aura;- praculan {or 0 101.011.0101.. point:
`-Prec=
`0.5515
`0.0500
`5 Pruc Chung. =
`15.0
`80301001:
`lcrl Rnclll
`Hon Prat”
`Bull 1100011
`Lag Prlcll
`Pracis Ifbnr 10 docs
`Prucia "but 80 docs
`Realll altar 10 docs
`RICIll urn-r 30 00:-
`E. o_5_ 10 docs
`E. 1.0. 10 00::
`5. 2.0. 10 docs
`E. 0.5. 00 00:!
`1:. 1.0. 30 docs
`5. 2.0. 30 00:.
`fluzbar qugriaa
`
`2
`1
`0.02741 0.0575
`0.7902 0.8503
`0.2140
`0.0044
`0.7070 0.7037
`0.0200 0.0007
`0.1500 0.5160
`0.5003 0.3300
`0.5117 0.503:
`0,5050
`0.1547
`0.8088
`0.5593
`0.0703
`0.0575
`0.5330
`0.4007
`0.4000
`0.4280
`0.4450 0.3000
`30
`30
`
`(a) ADI
`
`(b) Medlars
`
`u3l9—
`
`11
`11
`
`

`

`REFERENCES
`
`[ATTA R77]
`
`Attar, R., and A.S. Fraenkel, Local Feedback in Full-text Retrieval
`Systems, Journal ofthe ACM, 24(3):397-417, 1977.
`
`[BUCKLEY85]
`
`Implementation 01' the Smart Information Retrieval
`Buckley, (3.,
`System, Tech. Report 85-686, Dept. of Computer Science,
`Cornell University, May 1985.
`
`[CRAWFORD75]
`
`Crawford, R., The Computation of Discrimination Values,
`Information Processing and Management, 11:249-253, 1975.
`
`[CROUCHBB]
`
`An Analysis of Approximate Versus Exact
`Crouch, C.,
`Discrimination Values, Information Processing and Management,
`24(1):5-16, 1988.
`
`[Foxasa]
`
`[FOXBSb]
`
`Fox, E., Extending the Boolean and Vector Space Models of
`Information Retrieval with P-norm Queries and Multiple Concept
`Types, PhD. Thesis, Cornell University, Aug. 1983.
`
`Fox, E., Characteristics of Two New Experimental Collections in
`Computer and Informatiori Science Containing Textual and
`Bibliographic Concepts, Tech. Report 83-561, Dept. of Computer
`Science, Cornell University, Sept. 1983.
`
`[SALTON75a]
`
`Salton, G., 0.8. Yang, and C.T. Yu, A Theory of Term Importance
`in Automatic Text Analysis,
`Journal of the ASIS , 26(1):33—44,
`1975.
`
`[SALTON75b]
`
`Salton, G., A. Wong, and C. S. Yang, A Vector Space Model for
`Automatic Indexing, Communications of the ACM, 18:613—20,
`1975.
`
`[SALTON76]
`
`Salton, G., and A. Wong, On the Role of Words and Phrases in
`Automatic Text Analysis, Computers and Humanities, 10:69-87,
`1976.
`
`[SALTON87]
`
`Salton, G. and C. Buckley, Term Weighting Approaches in
`Automatic Text Retrieval, Tech. Report 87-881, Dept. of Computer
`Science, Cornell University, Nov. 1987.
`
`[VANRIJSB79]
`
`Information Retn’evaI, 2nd edition,
`van Rijsbergen, C.J.,
`Butterworths, London, England, 1979.
`
`[VOORHEESBS]
`
`Voorhees, E., The Effectiveness and Efficiency of Agglomerative
`Hierarchic Clustering in Document Retrieval, Tech. Report 85
`705, Dept. of Computer Science, Cornell University, Oct. 1985.
`
`[VOORHEESBS]
`
`Implementing Agglomerative Hierarchic Clustering
`Voorhees, E.,
`Algorithms for Use in Document Retrieval, Tech. Report 86-765,
`Dept. of Computer Science, Cornell University. July 1986.
`
`[WILI_ET85}
`
`WiIIet, P., An Algorithm for the Calculation of Exact Term
`Discrimination Values,
`Information Processing
`and
`Management, 21 (3):225-232, 1985.
`
`—320—
`
`12
`12
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket