throbber
A Cooccurrence-Based Thesaurus and
`Two Applications to Information Retrieval
`
`Hinrich Schutze and Jan O. Pedersen
`
`Xerox Palo Alto Research Center
`3333 Coyote Hill Road, Palo Alto, CA 94304, USA
`{schuetze,pedersen}@parc.xerox.com
`
`A bstract
`This paper presents a new method for computing
`a thesaurus from a text corpus. Each word is rep-
`resented as a vector in a multi-dimensional space
`that captures cooccurrence information. Words are
`defined to be similar if they have similar cooccur-
`rence patterns. Two different methods for using
`these thesaurus vectors in information retrieval are
`shown to significantly improve performance over the
`ARPA Tipster evaluation corpus as compared to a
`tf.idf baseline.
`
`Introduction
`1
`Information retrieval systems typically define sim-
`ilarity between queries and documents in terms of
`a weighted sum of matching words (e.g. the vector
`model, ¿is in Salton and McGill 1983). If a doc-
`ument is relevant but uses words synonymous to
`words in the query, it cannot be found. This is a
`particular problem if the query is short. One solu-
`tion is to lengthen the query through relevance feed-
`back (Salton and Buckley 1990). Another approach
`is to expand the query through synonym relations
`as found in a thesaurus. Here “synonym” is used
`loosely to mean “closely related word” , as opposed
`to “syntactically and semantically interchangeable
`word” . We define a thesaurus as simply a mapping
`from words to other closely related words.
`For a thesaurus to be useful in information re-
`trieval it must be specific enough to offer synonyms
`for words as used in the corpus of interest. For ex-
`ample, in a corpus of computer science documents
`the word “interpreter” would have meanings quite
`different from everyday language. It must also cover
`all or most of the words found in queries, includ-
`
`ing the potentially unbounded set of proper nouns.
`These two considerations suggest that generic the-
`sauri (such as Roget’s) that restrict themselves to
`common usage are unlikely to be helpful. Instead
`one must rely on thesauri tuned to the corpus of
`interest. These might be hand-built for a restricted
`domain or computed from the text of the corpus
`itself.
`This paper presents a new corpus-based method
`for constructing a thesaurus based on lexical cooc-
`currence. The computation proceeds in two phases.
`First, the lexical cooccurrence pattern of each word
`is represented as a multi-dimensional vector, the
`thesaurus vector. Second, a similarity measure is in-
`duced on words by comparing these vectors. Given
`a particular word its synonyms are then defined to
`be its nearest neighbors with respect to this simi-
`larity me¿wu^e.
`In the following we discuss previous approaches
`to thesaurus construction, describe the new cooc-
`currence- based thesaurus, and demonstrate two ap-
`plications of the thesaurus to information retrieval
`that improve performance as compared to a stan-
`dard vector-space similarity search baseline over
`the ARPA Tipster text retrieval evaluation corpus
`(Harman 1993b). The first application defines the
`context vector of a document to be the weighted
`sum of the thesaurus vectors of its contained words.
`These context vectors then induce a similarity mea-
`sure on documents and queries which can be di-
`rectly compared to standard vector-space meth-
`ods. The second application analyzes a query into
`subtopics. Documents are then scored and ranked
`by the degree to which they simultaneously match
`the subtopics of a query.
`
`1
`
`EX1023
`
`

`

`2 R elated Work
`A thesaurus is a data structure that defines seman-
`tic relatedness between words. It is typically used in
`information retrieval to expand search terms with
`other closely related words. Even if a thesaurus
`is not explicitly computed, the mapping performed
`by query expansion implicitly defines a thesaurus.
`Therefore, we will first discuss previous approaches
`to thesaurus construction and then comment on
`query expansion work proper.
`The simplest, and perhaps most conventional,
`approach to thesaurus construction is to manually
`build an explicit semantic mapping table. This is
`clearly labor-intensive, and hence only possible in
`specialized domains where repeated use may justify
`the cost. For example, the RUBRIC and TOPIC
`text retrieval systems (McCune et al. 1985) require
`a domain expert to prepare a hierarchical structure
`of “topics” (each topic is a boolean combination of
`other topics and search terms) germane to a par-
`ticular subject area. Searchers then employ terms
`from this hierarchy to form queries that autom ati-
`cally expand to complex boolean expressions.
`Another approach is to reuse existing online lex-
`icographic databases, such as WordNet (Voorhees
`and Hou 1992) or Longman’s subject codes (Liddy
`and Paik 1992). However, generic thesauri of this
`sort will often not be specific enough for the text
`collection at hand. For example, in (Voorhees and
`Hou 1992), “acts” is expanded with the meaning
`“acts of the apostles” in a corpus of legal docu-
`ments. In addition, they frequently do not record
`information about proper nouns, yet proper nouns
`are often excellent retrieval cues.
`Corpus-based methods perform a computation on
`the text of the documents in the corpus to induce a
`thesaurus. For example, Evans et al. (1991) con-
`struct a hierarchical thesaurus from a computed
`list of complex noun phrases where subsumption
`roughly corresponds to the subset relation defined
`on terms (e.g. “intelligence” subsumes “artificial
`intelligence”). While this method is superior to
`approaches that treat phrase terms as unanalyzed
`atoms, there is no notion of semantic similarity of
`basic terms. For example, the semantic similarity
`of “astronaut” and “cosmonaut” is not represented
`in the hierarchy.
`Grefenstette (1992) and Ruge (1992) use head-
`modifier relationships to determine semantic close-
`ness. This solution is costly since parsing tech-
`nology is req uired to determine head-modifier re-
`lations in sentences.
`It is also unclear to what
`extent words with similar heads or modifiers are
`
`good candidates for expansion. For example, adjec-
`tives referring to countries have similar heads (“the
`Japanese/Chilean capital” , “the Japanese/Chilean
`government”), but adding “Japanese” to a query
`that contains “Chilean” will rarely produce good
`results. Note that there are many words that distin-
`guish “Japanese” and “Chilean” in terms of coocur-
`rence in a sentence: “Tokyo”, “Andes”, “Samu-
`rai”, etc. Grefenstette (1992) demonstrates that
`head-modifier-based term expansion can improve
`retrieval performance. Our goal in this paper is
`to show that cooccurrence-based similarity, which
`is conceptually simpler than similarity with respect
`to heads or modifiers, is an equally powerful source
`of information for information retrieval.
`Crouch (1990) approaches semantic relatedness
`by considering the occurrence of terms in docu-
`ments. Documents are clustered into small groups
`based on a similarity measure that considers two
`documents similar if they share a significant num-
`ber of terms, with medium frequency terms pref-
`erentially weighted. Terms are then grouped by
`their occurrence in these document clusters. Since a
`complete-link document clustering is performed the
`procedure is very compute intensive; it would not
`scale to the Tipster reference collection. Further,
`the central assumption that terms are related if they
`often occur in the same documents seems problem-
`atic for corpora with long documents. It also does
`not capture the intuitive notion that synonyms do
`not cooccur, but rather have similar cooccurrence
`patterns.
`In contrast the procedure proposed in
`this paper makes use of lexical cooccurrence, which
`is more informative both qualitatively and quanti-
`tatively (cf. Schutze 1992).
`Two terms lexically cooccur if they appear in text
`within some distance of each other (typically a win-
`dow of k words). Qualitatively, the fact that two
`words often occur close to each other is more likely
`to be significant than the fact that they occur in
`the same documents. Quantitatively, there are on
`an order of magnitude more cooccurrence events
`than occurrence-in-document events in a given doc-
`ument collection. For a word occurring n times in
`the document collection and for a definition of cooc-
`currence as occurring in a window of k words, there
`are nk cooccurrence events, but only n occurrence-
`in-document events. If the goal is to capture in-
`formation about specific words, we believe that lex-
`ical cooccurrence is the preferred basis for statistical
`thesaurus construction.
`classes;
`thesaurus
`Crouch (1990)
`constructs
`words are binned into groups of related words. This
`is problematic since the boundaries between classes
`
`2
`
`

`

`will be inevitably somewhat artificial. If classes are
`made too small, some words will be cut off from
`part of their topical neighborhood. If they are too
`large, words will be forced into classes with words
`from different topics. Any particular class size will
`either separate some words from close neighbors or
`lump together some words with distant terms.
`In contrast, we propose to construct a multi-
`dimensional continuous space in which each word’s
`thesaurus vector represents its individual position.
`A continuous space does not force a classification
`choice, and hence avoids some of the ensuing prob-
`lems.
`Qiu and Frei (1993) present an elegant and effec-
`tive thesaurus construction by inverting the stan-
`dard vector-space document similarity function to
`define a similarity measure on terms. Terms are
`represented as high-dimensional vectors with a com-
`ponent for each document in the corpus. The
`value of each component is a function of the fre-
`quency the term has in that document. They show
`that query expansion using the cosine similarity
`measure on these vectors improves retrieval perfor-
`mance. However, because the term vectors are high-
`dimensional, the time complexity for computing the
`similarity between terms is related to the size of the
`corpus (in the same way that the cost of document
`similarity search is related to the size of the cor-
`pus). This prevents its use on a large scale, as will
`be proposed in the discussion on context vectors be-
`low. Further, as argued above lexical cooccurrence
`offers a richer basis for determining word similarity
`than document occurrence.
`Peat and W illett (1991) argue against the utility
`of cooccurrence-based expansion in principle. They
`observe that because synonyms often do not occur
`together a cooccurrence-based approach may have
`difficulty identifying synonymy relations. However,
`although synonyms frequently don’t cooccur, they
`tend to share neighbors that occur with both. For
`example, “litigation” and “lawsuit” share neighbors
`such as “court”, “judge”, and “proceedings”. Our
`thesaurus represents lexical cooccurrence patterns
`and hence defines semantic closeness in terms of
`common neighbors. This implies we do not require
`synonyms to cooccur, but rather require them to
`have similar cooccurrence patterns.
`A second problem pointed out by Peat and W il-
`lett is that many researchers use measures for defin-
`ing closeness that will group words according to fre-
`quency: it is impossible for a frequent word to have
`an infrequent neighbor according to these measures.
`We avoid this difficulty by reducing the dimension-
`ality of the thesaurus space using a singular value
`
`decomposition (cf. Deerwester et al. 1990). The rea-
`son for the closeness of terms with equal frequency
`is, roughly, that they have about the same number
`of zero entries in their term vectors. For a given
`term, SVD assigns values to all dimensions of the
`space, so that frequent and infrequent terms can be
`close in the reduced space if they occur with simi-
`lar terms. For example, “accident” (frequency 2590
`in our corpus) and “mishaps” (frequency 129) will
`come out as similar in the experiment described be-
`low despite the frequency difference between them.
`
`3 Cooccurrence Thesaurus
`The goal of the
`lexical-cooccurrence-based the-
`saurus is to associate with each term a vector that
`represents its pattern of local cooccurrences. This
`vector can then be compared with others to mea-
`sure the cooccurrence similarity, and hence seman-
`tic similarity of terms.
`The starting point of the computation is to collect
`a (symmetric) term-by-term matrix C, such that el-
`ement Cij records the number of times that words i
`and j cooccur in a window of size k (k is forty words
`in our experiments). Topical or semantic similarity
`between two words can then be defined as the co-
`sine between the corresponding columns of C. The
`assumption is that words with similar meanings will
`occur with similar neighbors if enough text mate-
`rial is available. Qiu and Frei (1993) use a similar
`scheme, although the matrix in their case is docu-
`ments vs. terms.
`However, simple resource calculations suggest
`that this direct approach is not workable. The ma-
`trix C has v2/2 distinct entries where v is the size
`of the vocabulary. Although this matrix is sparse,
`we can expect v to be very large, and hence the
`overall storage requirement to be unworkable. For
`example, the Tipster category B corpus (Harman
`1993b), which is the subject of our experiments, has
`over 450,000 unique terms.
`Even if enough memory were found to represent
`C directly, the thesaurus vectors associated with
`each word (columns of C) would be v-dimensional.
`Although these vectors are somewhat sparse, this
`implies that word comparisons are an order v op-
`eration, which is prohibitively expensive for large
`scale application.
`We address these issues by reducing the dimen-
`sionality of the problem to a workable size. The key
`dimensionality reduction tool is a singular value de-
`composition (Golub and van Loan 1989) of a matrix
`of cooccurrence counts. However, this matrix must
`
`3
`
`

`

`be constructed in a series of steps to keep the com-
`putations tractable at each stage.
`
`3.1 Practical Im plem entation
`The thesaurus is constructed in three steps as shown
`in Figure 1. The goal is to apply a singular value
`decomposition to reduce the dimensionality of the
`problem in a disciplined fashion and in the process
`produce more compact representations. However,
`since SVD is expensive, (time proportional to n 2,
`where n is the dimensionality of the m atrix), the
`dimensionality of the matrix fed into SVD cannot
`be too high. In particular, we cannot use the orig-
`inal matrix C. Instead we approximate it in a two
`stage computation that derives two sets of topical
`word classes from the corpus (the A-classes and B-
`classes in the figure) which we use to contain the
`dimensionality of the problem without sacrificing
`too much information.
`The topical word classes allow us to agglomerate
`information over similar words. We begin by con-
`structing the full cooccurrence matrix for a subset
`of terms in the corpus (see “Matrix A” in Figure 1).
`In our experiment we chose 3,000 medium frequency
`words (frequency ranks 2,000 through 5,000) for
`this subset. Element Oij of the matrix records the
`number of times that words Wi and Wj cooccurred
`in a window of 40 words in the text collection.
`We then form the first set of topical word classes
`by clustering this A-subset into a groups based on
`the cosine similarity between the columns of ma-
`trix A. In our experiment we found 200 A-classes
`9a u 9 a 2, • • •, 9 A200 using group average agglomera-
`tive clustering (Gnanadesikan 1977).
`We now consider a larger vocabulary subset
`and collect a second matrix B which records for
`each word in this larger B-subset the number of
`times words in each A-class occur in neighborhoods
`around that word (see “Matrix B” in Figure 1).
`Each element bij records the number of times that
`word Wj cooccurs with any of the medium-frequency
`words from class g^i- This is similar to the usual
`cooccurrence m atrix construction except that the
`matrix is no longer symmetric: rows correspond to
`A-classes, columns to words. In our experiment the
`B-subset contained the 20,000 most frequent words,
`excluding stop words. This B-subset is again parti-
`tioned into b word classes by clustering the columns
`of matrix B. The purpose of this second iteration
`is to ensure that each word in the corpus has suffi-
`ciently many neighbors from at least one word class.
`If we use only A-classes then many words would
`have no cooccurrence events.
`In contrast every
`
`word cooccurs with several words in the B-subset
`and hence will have many cooccurrence events with
`respect to B-classes. However, we could not com-
`pute the b-classes directly because a 6 x 6 matrix
`is computationally intractable. In our experiment
`the 20,000 columns of matrix B were clustered into
`200 B-classes gBu9B2, • ■ • , 9 B200 using the Buck-
`shot fast clustering algorithm (Cutting et al. 1992).1
`Finally, a third cooccurrence matrix C* is col-
`lected for the full corpus vocabulary vs. the B-
`classes (see “Matrix C” in Figure 1). Element Cij
`contains the number of times that term j cooccurs
`in a window of k words with any word in class gsi>
`Matrix C' has b rows and v columns. In our ex-
`periment we used all 176,116 words that occurred
`at least twice in the collection and all 272,914 pairs
`of adjacent words that occurred at least 5 times,
`for a total of 449,030 unique terms. At this stage
`an SVD dimensionality reduction to p (p < b) is
`performed so that each of the v terms can be repre-
`sented as a compact p dimensional vector and also
`to improve generalization (Deerwester et al. 1990,
`Berry 1992). In our experiment to reduce compute
`time only a subset of the matrix, corresponding to
`the 1 0 0 0th through 6000th most frequent word, was
`decomposed. This decomposition defined a map-
`ping from the 200-dimensional B-class space to a 20-
`dimensional reduced space. By applying the map-
`ping to each of the 449,030 200-component B-class
`vectors, a smaller 2 0-dimensional vector was com-
`puted for each word and pair.
`One might ask why we do not employ clustering
`for the final reduction in dimensionality. The an-
`swer lies in the smoothing and improved generality
`resulting from an SVD reduction. Similarity be-
`tween b-component vectors can be an errorful mea-
`sure of semantic similarity since there may be sev-
`eral word classes with similar topics. In our exper-
`iment, for example, class g s 4 contains words like
`“navy”, “radar”, and “missile” , while some of the
`members of class g s 47 are “tanks”, “missiles”, and
`“helicopters”. If one of two words has many neigh-
`bors in gsA and the other has many in gB47 then
`they would not be similar in the 2 0 0-dimensional
`space, but they are similar in the reduced space.
`This is because the SVD algorithm recognizes and
`eliminates such redundancies.
`As described above this computation requires
`four passes through the corpus. The first pass com-
`putes word and word pair frequencies. The sec-
`
`1A random sam ple of 2,000 of th e 20,000 words was clus-
`tered first, an d this clustering was th en extended to all 20,000
`words by assigning every word to th e cluster whose centroid
`it was closest to.
`
`4
`
`

`

`M atrix B
`
`M atrix C
`
`Figure 1: Iterative thesaurus construction.
`
`ond pass computes Matrix A and the A-classes, the
`third pass Matrix B and the B-classes, the fourth
`pass Matrix C. In addition, Matrix C is SVD de-
`composed and thesaurus vectors computed. In our
`experiment, each pass through the Tipster Category
`B corpus took roughly six hours (includes CPU and
`10 time). Note that these computations could have
`been accelerated by using loosely coupled coarse-
`grained parallelism to effect a linear reduction in
`compute time. The SVD decomposition required
`roughly 30 minutes to compute (using SVDPACK,
`Berry 1992).
`
`3.2 Sample Synonym s
`The net effect of this computation is to produce
`for each unique term a dense p-dimensional vector
`that characterizes its cooccurrence neighborhoods.
`These vectors then define a thesaurus by associating
`each word with its nearest neighbors with respect to
`a similarity measure on vectors, in our experiments
`the cosine. The following table displays some of the
`associations found in our experiment over the Tip-
`ster category B corpus. Each row displays a word
`and its nine nearest neighbors. For example, “re-
`pair” is the nearest neighbor of “accident”. Word
`pairs used as terms are displayed as couples sepa-
`rated by semicolon. Words in upper case are hand-
`selected synonyms as might be found in a manually
`constructed thesaurus. They are particularly in-
`teresting because they are unlikely to cooccur with
`their mates and hence illustrate that this thesaurus
`construction effectively uses second-order cooccur-
`rence (sharing neighbors in the corpus) rather than
`simple first-order coocurrence (occurring next to
`each other) to find synonyms.
`
`4 C ontext Vectors
`The thesaurus vectors computed above represent
`for each word its cooccurrence signature. To use
`this information directly in search, one needs a sim-
`ilar representation for documents. The simplest ap-
`proach is to represent each document by the vector
`which is the sum of the thesaurus vectors for the
`words in its text. Formally,
`
`<5 =
`
`Vij = (0.5 + 0.5 *
`
`where dj is the vector for document j, Wij is the
`weight for word i in document
`and Vi is the the-
`saurus vector for word i. Queries may be repre-
`sented as vectors in the same way.
`In our experiments, we use augmented tf.idf
`weighting when summing thesaurus vectors:
`(Salton and Buckley 1990)
`/ N \
`\ »
`tfjj
`maxi(tfij) * °9 ri*
`where tfij is the frequency of word i in document
`j, AT is the total number of documents, rii is the
`document frequency of word i.
`Recall that document vectors that are com-
`puted according to this scheme are called “con-
`text vectors”. Context vectors were first proposed
`by Gallant (1991) as an encoding based on hand-
`assigned microfeatures. In contrast, our approach is
`completely automatic since it depends only on the
`underlying thesaurus vectors. Hand-encoded fea-
`tures may also vary in how appropriate they are
`for different text collections. A derivation from the
`corpus of the application increases the chances that
`the representations are tuned to the relevant topics.
`The work on Latent Semantic Indexing (Deer-
`wester et al. 1990) also bears close resemblance to
`
`5
`
`

`

`accident
`advocates
`litigation
`tax
`treatment
`
`repair faulty personnel accidents exhaust equipped MISHAPS injuries sites
`passage PROPONENTS argument address favoring/-s compromise congress urge
`LAWSUITS audit lawsuit file auditors auditor suit sued proceedings
`taxes income;tax/-es newjtax taxpayer/-s incentives LEVIES corporate;taxes
`drugs syndrome administer/-ed study administering PROCEDURE undergo aids
`
`Table 1: Five terms and their nearest neighbors in the thesaurus.
`
`context vectors. The LSI algorithm can be inter-
`preted as computing document and query vectors
`as weighted sums of term vectors. W ith this in-
`terpretation, the difference between our approach
`and LSI is the different term weights (tf.idf weights
`in our case, weights representing the strengths of
`dimensions in the underlying latent structure in
`the case of LSI) and the different statistics from
`which term representations are derived (term cooc-
`currence in our case, document occurrence for LSI).
`Our method is also more efficient. The setup de-
`scribed in (Deerwester et al. 1990) requires a decom-
`position of a full term-by-document matrix, which
`for a large collection such as TIPSTER is an ex-
`tremely time-consuming operation, since its com-
`plexity is quadratic in the number of documents.
`Although our method requires two additional passes
`through the corpus (for collecting the matrices A
`and B), it took only about 30 minutes to compute
`the singular value decomposition described above.
`This time is independent of the size of the corpus.
`
`4.1 Application to TIPSTER
`Context vectors were computed for the 25 Category
`B topics of the Tipster collection (Harman 1993b)
`and for each of the about 173,000 Wall Street Jour-
`nal articles in the collection. For each query, doc-
`uments were ranked according to vector similar-
`ity as computed by the cosine measure and pre-
`cision/recall statistics collected. We compared our
`results against a baseline standard vector space sim-
`ilarity search with augmented tf.idf term weighting
`(Salton and McGill 1983,Salton and Buckley 1990).
`Our direct application of context vectors achieved
`a small performance improvement in average preci-
`sion. The baseline tf.idf average precision for this
`collection is 0.271 while context vectors achieved
`0.274.
`To achieve better results we considered schemes
`that combined the scores from the tf.idf baseline
`and context vectors. Formally, we considered doc-
`ument ranks of the form
`
`r' = a * Ttf.idf + (1 - a ) * rcv
`where rcv is the context vector rank and rtf.idf is
`
`the tf.idf rank and a is a free parameter between 0
`and 1.
`Figure 2 shows precision at eleven points of recall
`for tf.idf (bottom line) and for a linear combination
`for the optimal choice of a , found by exhaustive
`search (a = 0.7, middle line in Figure 2). The aver-
`age precision for tf.idf is 0.271 and the average preci-
`sion for the linear combination of tf.idf and context
`vectors is 0.300.
`
`5 Word Factorization
`Another application of thesaurus vectors is to an-
`alyze the query into topic-coherent word groups,
`which we call word factors. The goal is to ensure
`that documents are relevant to the entire query by
`arranging that their score with respect to each fac-
`tor be high. In addition, word factors may be manu-
`ally screened for relevance. Word factors containing
`nuisance or non-topical terms can be deleted from
`the query.
`We used group average agglomerative clustering
`to group query terms into factors based on their
`thesaurus vectors. In this experiment, we arbitrar-
`ily decided to cluster each topic into three word fac-
`tors. For example, the following three factors were
`found for topic description 51, which is about trade
`conflicts between the United States and European
`countries on subsidies to the aircraft industry. The
`pairs of words that were used as terms are printed
`as two consecutive words separated by a semicolon.
`All directly juxtaposed words occurring at least five
`times in the corpus were used as terms.
`
`• aid assistance british code complaint
`consortium controversy douglas economics
`europeanjgovernments financing french
`german government;assistance governments
`international;economics loan objection petition
`policy;review producer retaliation review sanc-
`tions Spanish Spanish ¡government tension
`
`• aeronáuticas aeronauticas;s.a aerospace
`aerospace;pic aerospatiale airbus airbus;indus-
`trie aircraft aircraft ¡consortium blohm boeing
`boelkow boelkow;blohm british ¡aerospace con-
`
`6
`
`

`

`0.8
`
`OO
`
`4
`
`1
`
`Figure 2: Precision for 11 recall points for tf.idf (bottom line), context vectors (middle line), and word
`factorization (top line).
`
`strucciones construcciones;aeronauticas
`douglas;corp european;aircraft gmbh mcdon-
`nell mcdonnell;douglas messerschmitt messer-
`schmitt;boelkow pic s.a
`
`• airbus;subsidies antidum ping countervailing
`countervailingjduty dumping dumping;duty
`federal;subsidies gatt general;agreement
`review;group subsidies tariffs tradejdispute
`trade;policy tradejtension
`
`The three factors correspond to the main topics of
`the query: international politics, the aircraft indus-
`try and trade conflicts.
`We can use these word factors to ameliorate one
`outstanding problem of similarity search: it treats
`search terms as if they were in one large disjunc-
`tion. By scoring each factor separately and recom-
`bining them appropriately, we force documents to
`score highly on all factors, and thus introduce a
`conjunctive constraint.
`For example, a document may score high for a
`query as a whole although it deals with only one
`of the subtopics of the query. A case in point is
`topic 51. Many high-scoring documents are about
`the aircraft industry without mentioning trade con-
`flicts or international politics. Instead of evaluat-
`ing the query as a whole, each subtopic should be
`evaluated individually and the results combined. If
`a document is irrelevant to one of the important
`subtopics of the query, then it often is irrelevant as
`a whole (for example a document on the aircraft
`
`industry without any mention of trade). Therefore,
`we employed the following algorithm for ranking
`documents:
`• rank documents for each of the subqueries
`9 i >9 2»9 3> resulting in ranks r ii,r 2«,r3j for doc-
`ument i
`• assign the maximum of the ranks to document
`i:
`
`/(¿) = maxj(rji)
`
`• rank the r' to get the final ranking
`This algorithm corresponds to imposing a boolean
`constraint on the subtopics of a query.
`The second goal of word factorization is to elim-
`inate irrelevant words semi-automatically. Many
`words in the Tipster topic descriptions are not rele-
`vant for the query in question, but they should not
`be placed on a stop list either because they could
`be relevant for other queries. For example, topic
`description 75 is about failed or successful autom a-
`tion. The topic is identified as belonging to the gen-
`eral area of “Science and Technology” . Therefore,
`“science” is one of the terms of the query. However,
`it is not relevant for the query. One of the word
`factors of topic 75 is the following:
`• failed instance force conversely science
`This word factor doesn’t contain good search terms
`and was therefore not used in retrieval. The deci-
`sion whether a word factor was relevant or not was
`
`7
`
`

`

`made manually. The word factors that were judged
`relevant were then combined according to the algo-
`rithm described above.
`As in the previous experiment, a linear combina-
`tion of tf.idf and context vectors to evaluate doc-
`ument rank with respect to each factor proved su-
`perior to using either method on its own (average
`precision 0.308 for a ranking based only on tf.idf,
`0.287 for a ranking based only on context vectors).
`Figure 2 shows precision for 11 recall points for
`a — 0.86. Average precision is 0.3218. This is a
`five percent improvement over the tf.idf result of
`0.271 reported above.
`This result is about the same as the best reported
`in (Harman 1993a) for manual and automatic adhoc
`in Category B (0.3219, Queens College CUNY, p.
`388). However, at this point our system lacks stem -
`ming, and the implementation of statistical phrases
`is restricted (only adjacent words were considered).
`We hope to improve our results further with stem-
`ming and more sophisticated statistical phrases.
`
`6 C onclusion
`We have proposed a new algorithm for inducing
`a thesaurus from a text corpus and have demon-
`strated two applications of it to information re-
`trieval. First, we have shown that representing doc-
`uments with context vectors derived from the the-
`saurus improves recall/precision performance in the
`Tipster collection. Second, we have introduced a
`novel application of thesauri to cluster query terms,
`and have shown that this procedure performs as well
`as the best results published in the TREC proceed-
`ings.
`Our experiments suggest the quality of the con-
`text vector of a document is sensitive to its length:
`Long documents produce poor context vectors since
`adding up thesaurus vectors from many topics leads
`to averaging out of fine distinctions that would be
`needed for good retrieval. Therefore, we are plan-
`ning experiments with segmented documents. We
`hope that our method will improve precision even
`further when applied to document fragments of sim-
`ilar length.
`
`thank Marti Hearst,
`A c k n o w le d g m e n ts. We
`David Hull and John Tukey for their helpful com-
`ments and Mike Berry for making SVDPACK avail-
`able to us.
`
`R eferences
`Berry, M. W. 1992. Large-scale sparse singular
`value computations. The International Journal
`of Supercomputer Applications 6(1):13—49.
`Crouch, C. J. 1990. An approach to the automatic
`construction of global thesauri. Information Pro-
`cessing & Management 26(5):629-640.
`Cutting, D. R., J. O. Pedersen, D. Karger, and
`J. W. Tukey. 1992. Scatter/gather: A cluster-
`based approach to browsing large document col-
`lections. In Proceedings of SIGIR '92, 318-329.
`Deerwester, S., S. T. Dumais, G. W. Furnas, T. K.
`Landauer, and R. Harshman. 1990. Indexing by
`latent semantic analysis. Journal of the American
`Society for Information Science 41(6):391-407.
`Evans, D. A., K. Ginther-Webster, M. Hart, R. G.
`Lefferts, and I. A. Mo

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket