`Two Applications to Information Retrieval
`
`Hinrich Schutze and Jan O. Pedersen
`
`Xerox Palo Alto Research Center
`3333 Coyote Hill Road, Palo Alto, CA 94304, USA
`{schuetze,pedersen}@parc.xerox.com
`
`A bstract
`
`This paper presents a new m ethod for com puting
`a thesaurus from a text corpus. Each word is rep
`resented as a vector in a multi-dimensional space
`that captures cooccurrence information. Words are
`defined to be similar if they have similar cooccur
`rence patterns. Two different m ethods for using
`these thesaurus vectors in information retrieval are
`shown to significantly improve performance over the
`ARPA Tipster evaluation corpus as compared to a
`tf.idf baseline.
`
`1
`
`Introduction
`
`Information retrieval systems typically define sim
`ilarity between queries and documents in terms of
`a weighted sum of matching words (e.g. the vector
`model, ¿is in Salton and McGill 1983). If a doc
`ument is relevant but uses words synonym ous to
`words in the query, it cannot be found. This is a
`particular problem if the query is short. One solu
`tion is to lengthen the query through relevance feed
`back (Salton and Buckley 1990). Another approach
`is to expand the query through synonym relations
`as found in a thesaurus. Here “synonym ” is used
`loosely to mean “closely related word” , as opposed
`to “syntactically and semantically interchangeable
`word” . We define a thesaurus as sim ply a mapping
`from words to other closely related words.
`For a thesaurus to be useful in information re
`trieval it must be specific enough to offer synonym s
`for words as used in the corpus of interest. For ex
`ample, in a corpus of computer science documents
`the word “interpreter” would have meanings quite
`different from everyday language. It must also cover
`all or m ost of the words found in queries, includ
`
`ing the potentially unbounded set of proper nouns.
`These two considerations suggest that generic the
`sauri (such as Roget’s) that restrict themselves to
`com mon usage are unlikely to be helpful. Instead
`one must rely on thesauri tuned to the corpus of
`interest. These might be hand-built for a restricted
`domain or computed from the text of the corpus
`itself.
`This paper presents a new corpus-based method
`for constructing a thesaurus based on lexical cooc
`currence. The com putation proceeds in two phases.
`First, the lexical cooccurrence pattern of each word
`is represented as a multi-dimensional vector, the
`thesaurus vector. Second, a similarity measure is in
`duced on words by comparing these vectors. Given
`a particular word its synonyms are then defined to
`be its nearest neighbors with respect to this sim i
`larity me¿wu^e.
`In the following we discuss previous approaches
`to thesaurus construction, describe the new cooc
`currence- based thesaurus, and dem onstrate two ap
`plications of the thesaurus to information retrieval
`that improve performance as compared to a stan
`dard vector-space similarity search baseline over
`the ARPA Tipster text retrieval evaluation corpus
`(Harman 1993b). The first application defines the
`context vector of a document to be the weighted
`sum of the thesaurus vectors of its contained words.
`These context vectors then induce a sim ilarity mea
`sure on documents and queries which can be di
`rectly compared to standard vector-space m eth
`ods. The second application analyzes a query into
`subtopics. Documents are then scored and ranked
`by the degree to which they sim ultaneously match
`the subtopics of a query.
`
`IPR2017-01039
`Unified EX1023 Page 1
`
`
`
`2 R elated Work
`
`A thesaurus is a data structure that defines seman
`tic relatedness between words. It is typically used in
`information retrieval to expand search terms with
`other closely related words. Even if a thesaurus
`is not explicitly computed, the mapping performed
`by query expansion im plicitly defines a thesaurus.
`Therefore, we will first discuss previous approaches
`to thesaurus construction and then com ment on
`query expansion work proper.
`The sim plest, and perhaps m ost conventional,
`approach to thesaurus construction is to manually
`build an explicit semantic mapping table. This is
`clearly labor-intensive, and hence only possible in
`specialized domains where repeated use may justify
`the cost. For example, the RUBRIC and TOPIC
`text retrieval system s (McCune et al. 1985) require
`a dom ain expert to prepare a hierarchical structure
`of “topics” (each topic is a boolean com bination of
`other topics and search terms) germane to a par
`ticular subject area. Searchers then em ploy terms
`from this hierarchy to form queries that autom ati
`cally expand to com plex boolean expressions.
`Another approach is to reuse existing online lex
`icographic databases, such as WordNet (Voorhees
`and Hou 1992) or Longman’s subject codes (Liddy
`and Paik 1992). However, generic thesauri of this
`sort will often not be specific enough for the text
`collection at hand. For example, in (Voorhees and
`Hou 1992), “acts” is expanded with the meaning
`“acts of the apostles” in a corpus of legal docu
`ments. In addition, they frequently do not record
`information about proper nouns, yet proper nouns
`are often excellent retrieval cues.
`Corpus-based m ethods perform a com putation on
`the text of the documents in the corpus to induce a
`thesaurus. For example, Evans et al. (1991) con
`struct a hierarchical thesaurus from a com puted
`list of com plex noun phrases where subsum ption
`roughly corresponds to the subset relation defined
`on terms (e.g. “intelligence” subsumes “artificial
`intelligence”). W hile this m ethod is superior to
`approaches that treat phrase terms as unanalyzed
`atom s, there is no notion of semantic sim ilarity of
`basic terms. For example, the semantic sim ilarity
`of “astronaut” and “cosmonaut” is not represented
`in the hierarchy.
`Grefenstette (1992) and Ruge (1992) use head-
`modifier relationships to determine semantic close
`ness. This solution is costly since parsing tech
`nology is req u ired to determine head-modifier re
`lations in sentences.
`It is also unclear to what
`extent words with similar heads or modifiers are
`
`good candidates for expansion. For exam ple, adjec
`tives referring to countries have similar heads ( “the
`Japanese/Chilean capital” , “the Japanese/Chilean
`government”), but adding “Japanese” to a query
`that contains “Chilean” will rarely produce good
`results. Note that there are many words that distin
`guish “Japanese” and “Chilean” in terms of coocur-
`rence in a sentence: “Tokyo” , “Andes”, “Samu
`rai” , etc. Grefenstette (1992) dem onstrates that
`head-modifier-based term expansion can improve
`retrieval performance. Our goal in this paper is
`to show that cooccurrence-based similarity, which
`is conceptually simpler than sim ilarity with respect
`to heads or modifiers, is an equally powerful source
`of information for information retrieval.
`Crouch (1990) approaches semantic relatedness
`by considering the occurrence of terms in docu
`ments. Documents are clustered into sm all groups
`based on a similarity measure that considers two
`documents similar if they share a significant num
`ber of terms, with medium frequency terms pref
`erentially weighted. Terms are then grouped by
`their occurrence in these document clusters. Since a
`complete-link document clustering is performed the
`procedure is very com pute intensive; it would not
`scale to the Tipster reference collection. Further,
`the central assumption that terms are related if they
`often occur in the same docum ents seems problem
`atic for corpora with long documents. It also does
`not capture the intuitive notion that synonym s do
`not cooccur, but rather have similar cooccurrence
`patterns.
`In contrast the procedure proposed in
`this paper makes use of lexical cooccurrence, which
`is more informative both qualitatively and quanti
`tatively (cf. Schutze 1992).
`Two terms lexically cooccur if they appear in text
`within som e distance of each other (typically a win
`dow of k words). Qualitatively, the fact that two
`words often occur close to each other is more likely
`to be significant than the fact that they occur in
`the same documents. Quantitatively, there are on
`an order of magnitude more cooccurrence events
`than occurrence-in-document events in a given doc
`ument collection. For a word occurring n tim es in
`the document collection and for a definition of cooc
`currence as occurring in a window of k words, there
`are nk cooccurrence events, but only n occurrence-
`in-document events. If the goal is to capture in
`formation about specific words, we believe that lex
`ical cooccurrence is the preferred basis for statistical
`thesaurus construction.
`classes;
`thesaurus
`Crouch (1990)
`constructs
`words are binned into groups of related words. This
`is problematic since the boundaries between classes
`
`IPR2017-01039
`Unified EX1023 Page 2
`
`
`
`will be inevitably som ewhat artificial. If classes are
`made too sm all, som e words will be cut off from
`part of their topical neighborhood. If they are too
`large, words will be forced into classes with words
`from different topics. Any particular class size will
`either separate some words from close neighbors or
`lump together some words with distant terms.
`In contrast, we propose to construct a m ulti
`dimensional continuous space in which each word’s
`thesaurus vector represents its individual position.
`A continuous space does not force a classification
`choice, and hence avoids some of the ensuing prob
`lems.
`Qiu and Frei (1993) present an elegant and effec
`tive thesaurus construction by inverting the stan
`dard vector-space document sim ilarity function to
`define a similarity measure on terms. Terms are
`represented as high-dimensional vectors with a com
`ponent for each document in the corpus. The
`value of each component is a function of the fre
`quency the term has in that document. They show
`that query expansion using the cosine similarity
`measure on these vectors improves retrieval perfor
`mance. However, because the term vectors are high
`dimensional, the tim e com plexity for com puting the
`similarity between terms is related to the size of the
`corpus (in the same way that the cost of document
`similarity search is related to the size of the cor
`pus). This prevents its use on a large scale, as will
`be proposed in the discussion on context vectors be
`low. Further, as argued above lexical cooccurrence
`offers a richer basis for determining word similarity
`than document occurrence.
`Peat and W illett (1991) argue against the utility
`of cooccurrence-based expansion in principle. They
`observe that because synonym s often do not occur
`together a cooccurrence-based approach may have
`difficulty identifying synonym y relations. However,
`although synonym s frequently don’t cooccur, they
`tend to share neighbors that occur with both. For
`example, “litigation” and “lawsuit” share neighbors
`such as “court”, “judge”, and “proceedings”. Our
`thesaurus represents lexical cooccurrence patterns
`and hence defines semantic closeness in terms of
`common neighbors. This implies we do not require
`synonym s to cooccur, but rather require them to
`have similar cooccurrence patterns.
`A second problem pointed out by Peat and W il
`lett is that many researchers use measures for defin
`ing closeness that will group words according to fre
`quency: it is im possible for a frequent word to have
`an infrequent neighbor according to these measures.
`We avoid this difficulty by reducing the dimension
`ality of the thesaurus space using a singular value
`
`decomposition (cf. Deerwester et al. 1990). The rea
`son for the closeness of terms with equal frequency
`is, roughly, that they have about the same number
`of zero entries in their term vectors. For a given
`term, SVD assigns values to all dimensions of the
`space, so that frequent and infrequent terms can be
`close in the reduced space if they occur with simi
`lar terms. For exam ple, “accident” (frequency 2590
`in our corpus) and “m ishaps” (frequency 129) will
`come out as similar in the experiment described be
`low despite the frequency difference between them.
`
`3 C ooccurrence Thesaurus
`
`lexical-cooccurrence-based the
`The goal of the
`saurus is to associate with each term a vector that
`represents its pattern of local cooccurrences. This
`vector can then be compared with others to mea
`sure the cooccurrence similarity, and hence seman
`tic similarity of terms.
`The starting point of the com putation is to collect
`a (sym metric) term-by-term matrix C, such that el
`ement Cij records the number of times that words i
`and j cooccur in a window of size k (k is forty words
`in our experiments). Topical or semantic similarity
`between two words can then be defined as the co
`sine between the corresponding columns of C. The
`assumption is that words with similar meanings will
`occur with similar neighbors if enough text m ate
`rial is available. Qiu and Frei (1993) use a similar
`scheme, although the m atrix in their case is docu
`ments vs. terms.
`However, sim ple resource calculations suggest
`that this direct approach is not workable. The m a
`trix C has v 2/2 distinct entries where v is the size
`of the vocabulary. Although this matrix is sparse,
`we can expect v to be very large, and hence the
`overall storage requirement to be unworkable. For
`example, the Tipster category B corpus (Harman
`1993b), which is the subject of our experiments, has
`over 450,000 unique terms.
`Even if enough memory were found to represent
`C directly, the thesaurus vectors associated with
`each word (columns of C) would be v-dimensional.
`Although these vectors are somewhat sparse, this
`implies that word comparisons are an order v op
`eration, which is prohibitively expensive for large
`scale application.
`We address these issues by reducing the dimen
`sionality of the problem to a workable size. The key
`dimensionality reduction tool is a singular value de
`composition (Golub and van Loan 1989) of a matrix
`of cooccurrence counts. However, this matrix must
`
`IPR2017-01039
`Unified EX1023 Page 3
`
`
`
`be constructed in a series of steps to keep the com
`putations tractable at each stage.
`
`3.1 Practical Im plem entation
`
`The thesaurus is constructed in three steps as shown
`in Figure 1. The goal is to apply a singular value
`decom position to reduce the dim ensionality of the
`problem in a disciplined fashion and in the process
`produce more com pact representations. However,
`since SVD is expensive, (tim e proportional to n 2,
`where n is the dim ensionality of the m atrix), the
`dim ensionality of the m atrix fed into SVD cannot
`be too high. In particular, we cannot use the orig
`inal m atrix C. Instead we approximate it in a two
`stage com putation that derives two sets of topical
`word classes from the corpus (the A-classes and B-
`classes in the figure) which we use to contain the
`dim ensionality of the problem without sacrificing
`too much information.
`The topical word classes allow us to agglomerate
`information over similar words. We begin by con
`structing the full cooccurrence m atrix for a subset
`of terms in the corpus (see “Matrix A ” in Figure 1).
`In our experiment we chose 3,000 medium frequency
`words (frequency ranks 2,000 through 5,000) for
`this subset. Element Oij of the m atrix records the
`number of tim es that words Wi and Wj cooccurred
`in a window of 40 words in the text collection.
`We then form the first set of topical word classes
`by clustering this A-subset into a groups based on
`the cosine sim ilarity between the columns of ma
`trix A. In our experiment we found 200 A-classes
`9a u 9 a 2, • • •, 9 A200 using group average agglomera-
`tive clustering (Gnanadesikan 1977).
`We now consider a larger vocabulary subset
`and collect a second m atrix B which records for
`each word in this larger B-subset the number of
`tim es words in each A-class occur in neighborhoods
`around that word (see “Matrix B” in Figure 1).
`Each element bij records the number of times that
`word Wj cooccurs with any of the medium-frequency
`words from class g^i- This is similar to the usual
`cooccurrence m atrix construction except that the
`m atrix is no longer symmetric: rows correspond to
`A-classes, columns to words. In our experiment the
`B-subset contained the 20,000 most frequent words,
`excluding stop words. This B-subset is again parti
`tioned into b word classes by clustering the columns
`of m atrix B. The purpose of this second iteration
`is to ensure that each word in the corpus has suffi
`ciently many neighbors from at least one word class.
`If we use only A-classes then many words would
`have no cooccurrence events.
`In contrast every
`
`word cooccurs with several words in the B-subset
`and hence will have many cooccurrence events with
`respect to B-classes. However, we could not com
`pute the b-classes directly because a 6 x 6 m atrix
`is com putationally intractable. In our experiment
`the 20,000 columns of m atrix B were clustered into
`200 B-classes gBu9B2, • ■ • , 9 B200 using the Buck
`shot fast clustering algorithm (Cutting et al. 1992).1
`Finally, a third cooccurrence m atrix C* is col
`lected for the full corpus vocabulary vs. the B-
`classes (see “Matrix C” in Figure 1). Element Cij
`contains the number of tim es that term j cooccurs
`in a window of k words with any word in class gsi>
`Matrix C ' has b rows and v columns. In our ex
`periment we used all 176,116 words that occurred
`at least twice in the collection and all 272,914 pairs
`of adjacent words that occurred at least 5 times,
`for a total of 449,030 unique terms. At this stage
`an SVD dim ensionality reduction to p (p < b) is
`performed so that each of the v terms can be repre
`sented as a com pact p dimensional vector and also
`to improve generalization (Deerwester et al. 1990,
`Berry 1992). In our experiment to reduce compute
`tim e only a subset of the m atrix, corresponding to
`the 1 0 0 0 th through 6000th m ost frequent word, was
`decomposed. This decom position defined a m ap
`ping from the 200-dimensional B-class space to a 20-
`dimensional reduced space. By applying the m ap
`ping to each of the 449,030 200-component B-class
`vectors, a smaller 2 0-dimensional vector was com
`puted for each word and pair.
`One m ight ask why we do not employ clustering
`for the final reduction in dimensionality. The an
`swer lies in the sm oothing and improved generality
`resulting from an SVD reduction. Similarity be
`tween b-component vectors can be an errorful mea
`sure of semantic sim ilarity since there may be sev
`eral word classes with similar topics. In our exper
`iment, for exam ple, class g s 4 contains words like
`“navy” , “radar”, and “m issile” , while some of the
`members of class g s 47 are “tanks”, “m issiles”, and
`“helicopters” . If one of two words has many neigh
`bors in gsA and the other has many in gB47 then
`they would not be similar in the 2 0 0 -dimensional
`space, but they are similar in the reduced space.
`This is because the SVD algorithm recognizes and
`elim inates such redundancies.
`As described above this com putation requires
`four passes through the corpus. The first pass com
`putes word and word pair frequencies. The sec-
`
`1A ran d o m sam ple of 2,000 of th e 20,000 words was clus
`te red first, an d th is clustering was th e n exten d ed to all 20,000
`words by assigning every word to th e cluster whose centroid
`it was closest to.
`
`IPR2017-01039
`Unified EX1023 Page 4
`
`
`
`M atrix B
`
`M atrix C
`
`Figure 1: Iterative thesaurus construction.
`
`ond pass com putes Matrix A and the A-classes, the
`third pass Matrix B and the B-classes, the fourth
`pass Matrix C. In addition, Matrix C is SVD de
`composed and thesaurus vectors computed. In our
`experiment, each pass through the Tipster Category
`B corpus took roughly six hours (includes CPU and
`10 tim e). Note that these com putations could have
`been accelerated by using loosely coupled coarse
`grained parallelism to effect a linear reduction in
`com pute time. The SVD decomposition required
`roughly 30 m inutes to com pute (using SVDPACK,
`Berry 1992).
`
`3.2 Sam ple Synonym s
`
`The net effect of this com putation is to produce
`for each unique term a dense p-dimensional vector
`that characterizes its cooccurrence neighborhoods.
`These vectors then define a thesaurus by associating
`each word with its nearest neighbors with respect to
`a sim ilarity measure on vectors, in our experiments
`the cosine. The following table displays some of the
`associations found in our experiment over the Tip
`ster category B corpus. Each row displays a word
`and its nine nearest neighbors. For example, “re
`pair” is the nearest neighbor of “accident”. Word
`pairs used as terms are displayed as couples sepa
`rated by semicolon. Words in upper case are hand-
`selected synonym s as might be found in a manually
`constructed thesaurus. They are particularly in
`teresting because they are unlikely to cooccur with
`their m ates and hence illustrate that this thesaurus
`construction effectively uses second-order cooccur
`rence (sharing neighbors in the corpus) rather than
`sim ple first-order coocurrence (occurring next to
`each other) to find synonyms.
`
`4 C ontext V ectors
`
`The thesaurus vectors computed above represent
`for each word its cooccurrence signature. To use
`this information directly in search, one needs a sim
`ilar representation for documents. The simplest ap
`proach is to represent each document by the vector
`which is the sum of the thesaurus vectors for the
`words in its text. Formally,
`
`<5 =
`
`where dj is the vector for document j, Wij is the
`weight for word i in document
`and Vi is the the
`saurus vector for word i. Queries may be repre
`sented as vectors in the same way.
`In our experiments, we use augmented tf.idf
`weighting when sum m ing thesaurus vectors:
`(Salton and Buckley 1990)
`
`Vij = (0.5 + 0.5 *
`
`/ N \
`\ »
`tfjj
`maxi(tfij) * °9 ri*
`
`where tfij is the frequency of word i in document
`j, AT is the total number of documents, rii is the
`document frequency of word i.
`Recall that document vectors that are com
`puted according to this scheme are called “con
`text vectors”. Context vectors were first proposed
`by Gallant (1991) as an encoding based on hand-
`assigned microfeatures. In contrast, our approach is
`com pletely autom atic since it depends only on the
`underlying thesaurus vectors. Hand-encoded fea
`tures may also vary in how appropriate they are
`for different text collections. A derivation from the
`corpus of the application increases the chances that
`the representations are tuned to the relevant topics.
`The work on Latent Semantic Indexing (Deer-
`wester et al. 1990) also bears close resemblance to
`
`IPR2017-01039
`Unified EX1023 Page 5
`
`
`
`accident
`advocates
`litigation
`tax
`treatment
`
`repair faulty personnel accidents exhaust equipped MISHAPS injuries sites
`passage PROPONENTS argument address favoring/-s compromise congress urge
`LAWSUITS audit lawsuit file auditors auditor suit sued proceedings
`taxes income;tax/-es newjtax taxpayer/-s incentives LEVIES corporate;taxes
`drugs syndrome administer/-ed study administering PROCEDURE undergo aids
`
`Table 1: Five terms and their nearest neighbors in the thesaurus.
`
`context vectors. The LSI algorithm can be inter
`preted as com puting document and query vectors
`as weighted sums of term vectors. W ith this in
`terpretation, the difference between our approach
`and LSI is the different term weights (tf.idf weights
`in our case, weights representing the strengths of
`dim ensions in the underlying latent structure in
`the case of LSI) and the different statistics from
`which term representations are derived (term cooc
`currence in our case, document occurrence for LSI).
`Our method is also more efficient. The setup de
`scribed in (Deerwester et al. 1990) requires a decom
`position of a full term-by-document m atrix, which
`for a large collection such as TIPSTER is an ex
`tremely tim e-consum ing operation, since its com
`plexity is quadratic in the number of documents.
`Although our m ethod requires two additional passes
`through the corpus (for collecting the matrices A
`and B), it took only about 30 minutes to com pute
`the singular value decomposition described above.
`This tim e is independent of the size of the corpus.
`
`4.1 Application to TIPSTER
`
`Context vectors were computed for the 25 Category
`B topics of the Tipster collection (Harman 1993b)
`and for each of the about 173,000 Wall Street Jour
`nal articles in the collection. For each query, doc
`uments were ranked according to vector similar
`ity as computed by the cosine measure and pre-
`cision/recall statistics collected. We compared our
`results against a baseline standard vector space sim
`ilarity search with augmented tf.idf term weighting
`(Salton and McGill 1983,Salton and Buckley 1990).
`Our direct application of context vectors achieved
`a sm all performance improvement in average preci
`sion. The baseline tf.idf average precision for this
`collection is 0.271 while context vectors achieved
`0.274.
`To achieve better results we considered schemes
`that combined the scores from the tf.idf baseline
`and context vectors. Formally, we considered doc
`ument ranks of the form
`
`r' = a * Ttf.idf + (1 - a ) * rcv
`
`where rcv is the context vector rank and rtf.idf is
`
`the tf.idf rank and a is a free parameter between 0
`and 1 .
`Figure 2 shows precision at eleven points of recall
`for tf.idf (bottom line) and for a linear combination
`for the optim al choice of a , found by exhaustive
`search (a = 0.7, middle line in Figure 2). The aver
`age precision for tf.idf is 0.271 and the average preci
`sion for the linear combination of tf.idf and context
`vectors is 0.300.
`
`5 W ord Factorization
`
`Another application of thesaurus vectors is to an
`alyze the query into topic-coherent word groups,
`which we call word factors. The goal is to ensure
`that documents are relevant to the entire query by
`arranging that their score with respect to each fac
`tor be high. In addition, word factors may be m anu
`ally screened for relevance. Word factors containing
`nuisance or non-topical terms can be deleted from
`the query.
`We used group average agglom erative clustering
`to group query terms into factors based on their
`thesaurus vectors. In this experiment, we arbitrar
`ily decided to cluster each topic into three word fac
`tors. For example, the following three factors were
`found for topic description 51, which is about trade
`conflicts between the United States and European
`countries on subsidies to the aircraft industry. The
`pairs of words that were used as terms are printed
`as two consecutive words separated by a semicolon.
`All directly juxtaposed words occurring at least five
`tim es in the corpus were used as terms.
`
`• aid assistance british code com plaint
`consortium controversy douglas economics
`europeanjgovernments financing french
`german government;assistance governments
`international;economics loan objection petition
`policy;review producer retaliation review sanc
`tions Spanish Spanish ¡government tension
`
`• aeronáuticas aeronauticas;s.a aerospace
`aerospace;pic aerospatiale airbus airbus;indus-
`trie aircraft aircraft ¡consortium blohm boeing
`boelkow boelkow;blohm british ¡aerospace con-
`
`IPR2017-01039
`Unified EX1023 Page 6
`
`
`
`0.8
`
`OO
`
`4
`
`1
`
`Figure 2: Precision for 11 recall points for tf.idf (bottom line), context vectors (middle line), and word
`factorization (top line).
`
`strucciones construcciones;aeronauticas
`douglas;corp european;aircraft gmbh mcdon-
`nell mcdonnell;douglas messerschmitt messer-
`schmitt;boelkow pic s.a
`
`• airbus;subsidies antidum ping countervailing
`countervailingjduty dumping dumping;duty
`federal;subsidies gatt general;agreement
`review;group subsidies tariffs tradejdispute
`trade;policy tradejtension
`
`The three factors correspond to the main topics of
`the query: international politics, the aircraft indus
`try and trade conflicts.
`We can use these word factors to ameliorate one
`outstanding problem of sim ilarity search: it treats
`search terms as if they were in one large disjunc
`tion. By scoring each factor separately and recom
`bining them appropriately, we force documents to
`score highly on all factors, and thus introduce a
`conjunctive constraint.
`For example, a document may score high for a
`query as a whole although it deals with only one
`of the subtopics of the query. A case in point is
`topic 51. Many high-scoring documents are about
`the aircraft industry without mentioning trade con
`flicts or international politics. Instead of evaluat
`ing the query as a whole, each subtopic should be
`evaluated individually and the results combined. If
`a document is irrelevant to one of the important
`subtopics of the query, then it often is irrelevant as
`a whole (for example a document on the aircraft
`
`industry without any mention of trade). Therefore,
`we employed the following algorithm for ranking
`documents:
`
`• rank documents for each of the subqueries
`9 i >9 2»9 3> resulting in ranks r ii,r 2«,r3j for doc
`ument i
`
`• assign the m aximum of the ranks to document
`i:
`
`/(¿) = maxj(rji)
`
`• rank the r' to get the final ranking
`
`This algorithm corresponds to im posing a boolean
`constraint on the subtopics of a query.
`The second goal of word factorization is to elim
`inate irrelevant words semi-automatically. Many
`words in the Tipster topic descriptions are not rele
`vant for the query in question, but they should not
`be placed on a stop list either because they could
`be relevant for other queries. For example, topic
`description 75 is about failed or successful autom a
`tion. The topic is identified as belonging to the gen
`eral area of “Science and Technology” . Therefore,
`“science” is one of the terms of the query. However,
`it is not relevant for the query. One of the word
`factors of topic 75 is the following:
`
`• failed instance force conversely science
`
`This word factor doesn’t contain good search terms
`and was therefore not used in retrieval. The deci
`sion whether a word factor was relevant or not was
`
`IPR2017-01039
`Unified EX1023 Page 7
`
`
`
`made manually. The word factors that were judged
`relevant were then combined according to the algo
`rithm described above.
`As in the previous experiment, a linear combina
`tion of tf.idf and context vectors to evaluate doc
`ument rank with respect to each factor proved su
`perior to using either m ethod on its own (average
`precision 0.308 for a ranking based only on tf.idf,
`0.287 for a ranking based only on context vectors).
`Figure 2 shows precision for 11 recall points for
`a — 0.86. Average precision is 0.3218. This is a
`five percent improvement over the tf.idf result of
`0.271 reported above.
`This result is about the same as the best reported
`in (Harman 1993a) for manual and autom atic adhoc
`in Category B (0.3219, Queens College CUNY, p.
`388). However, at this point our system lacks stem
`ming, and the im plem entation of statistical phrases
`is restricted (only adjacent words were considered).
`We hope to improve our results further with stem
`m ing and more sophisticated statistical phrases.
`
`6 C onclusion
`
`We have proposed a new algorithm for inducing
`a thesaurus from a text corpus and have demon
`strated two applications of it to information re
`trieval. First, we have shown that representing doc
`uments with context vectors derived from the the
`saurus improves recall/precision performance in the
`Tipster collection. Second, we have introduced a
`novel application of thesauri to cluster query terms,
`and have shown that this procedure performs as well
`as the best results published in the TREC proceed
`ings.
`Our experim ents suggest the quality of the con
`text vector of a docum ent is sensitive to its length:
`Long documents produce poor context vectors since
`adding up thesaurus vectors from many topics leads
`to averaging out of fine distinctions that would be
`needed for good retrieval. Therefore, we are plan
`ning experim ents with segmented documents. We
`hope that our m ethod will improve precision even
`further when applied to document fragments of sim
`ilar length.
`
`thank Marti Hearst,
`A c k n o w le d g m e n ts . We
`David Hull and John Tukey for their helpful com
`ments and Mike Berry for making SVDPACK avail
`able to us.
`
`R eferences
`
`Berry, M. W . 1992. Large-scale sparse singular
`value com putations. The International Journal
`of Supercomputer Applications 6(1):13—49.
`
`Crouch, C. J. 1990. An approach to the autom atic
`construction of global thesauri. Information Pro
`cessing & Management 26(5):629-640.
`
`Cu