`Information Retrieval
`
`K.L. Kwok
`Computer Science Department, Queens College, City University of New York, Flushing, NY 11367.
`E-mail: kwok@ir.cs.qc.edu URL: http://ir.cs.qc.edu
`
`For information retrieval in the Chinese language, three
`representation methods for texts are popular, namely:
`1-gram or character, bigram, and short-word. Each has
`its advantages as well as drawbacks. Employing more
`than one method may combine advantages from them
`and enhance retrieval effectiveness. We investigated
`two ways of using them simultaneously: mixing repre-
`sentations in documents and queries, and combining
`retrieval lists obtained via different representations. The
`experiments were done with the 170 MB evaluated Chi-
`nese corpora and 54 long and short queries available
`from the (TREC) program and using our Probabilistic
`Indexing and Retrieval Components System (PIRCS re-
`trieval system). Experiments show that good retrieval
`need not depend on accurate word segmentation; ap-
`proximate segmentation into short-words will do. Re-
`sults also show and confirm that bigram representation
`alone works well; mixing characters with bigram repre-
`sentation boosts effectiveness further, but it is prefera-
`ble to mix characters with short-word indexing which is
`more efficient, needs less resource, and gives better
`retrieval more often. Combining retrieval
`lists from
`short-word with character representation and from big-
`ram indexing provides the best retrieval results but also
`at a substantial cost.
`
`1. Introduction
`
`It is well known that a sentence in Chinese (and some
`other agglutinative Asian languages) consists of a continu-
`ous string of ‘characters’ without delimiting white spaces to
`identify words. In Chinese, the characters are called ideo-
`graphs. This makes it difficult to do computational studies
`on these languages since isolated words are needed for
`many purposes such as: linguistic analysis, machine trans-
`
`Some results in this paper have been published in previous conference
`proceedings, see (Kwok 97a,b).
`
`Received April 24, 1998; revised November 25, 1998; accepted February
`9, 1999.
`
`© 1999 John Wiley & Sons, Inc.
`
`lation, etc. Automatic methods for isolating words in a
`sentence—a process called word segmentation—is there-
`fore an important and necessary first step to be taken before
`other analysis can begin. Chinese words can be any number
`of characters in length, although the majority of the com-
`mon words are four characters or less (except for some
`proper nouns). Longer words are most probably phrases, but
`the boundaries between words and phrases are not clear.
`Many researchers have proposed practical methods to do
`word segmentation: (Chen & Kiu, 1992; Wu & Tsang,
`1995; Nie, Brisebois, & Ren, 1996; Jin & Chen, 1996; Ponte
`& Croft, 1996; Sproat, Shih, Gale, & Chang, 1996; Sun,
`Shen, & Huang, 1997).
`Information retrieval (IR) deals with the task of selecting
`relevant documents for a user need that is expressed in free
`text. The document collection is usually huge, gigabytes in
`size, and both queries and documents are domain unre-
`stricted and unpredictable. When one does IR in the Chinese
`language with its peculiar property, then one would expect
`that accurate word segmentation is also a crucial first step
`before other processing can begin.
`However, in the recent TREC-5 & 6 workshops (Text
`REtrieval Conferences sponsored by the National Institute
`of Standards & Technology (NIST) and the Defense Ad-
`vanced Research and Program Agency (DARPA)) where
`fairly large scale, blind Chinese retrieval experiments were
`performed (Kwok & Grunfeld, 1997; Kwok, Grunfeld, &
`Xu, 1998), we have demonstrated that an approximate
`short-word segmentation method, coupled with a powerful
`retrieval algorithm, is sufficient to provide very good re-
`sults. Moreover, experiments by us as well as by others
`using even simpler bigram representation of text (i.e., all
`consecutive overlapping two characters), both within and
`outside the TREC environment, also produce comparable
`results (Chien, 1995; Liang, Lee, & Yang, 1996; Chen, He,
`Xu, Gey, & Meggs, 1997; Kwok, 1997a). This is a bit
`counter-intuitive because the bigram method leads to over
`three times as large an indexing feature space compared
`with our segmentation (approximately 1.5 million vs. 0.4
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 50(8):709 –723, 1999
`
`CCC 0002-8231/99/080709-15
`
`AOL Ex. 1025
`Page 1 of 15
`
`
`
`million), and one would expect that there are many random,
`non-content matchings between queries and documents that
`may adversely affect outcome. Apparently, this is not so.
`Even 1-gram (single character) representation can return
`lesser but still quite respectable retrieval results (Kwok,
`1997a). Since IR is a difficult task, it may be helpful to
`combine these representations together to enhance retrieval
`since each representation type can have merits and strengths
`of its own.
`Employing multiple representations to improve retrieval
`is often done in English IR. Katzer et al. (1982) has shown
`that different document representations can lead to different
`sets of documents retrieved even though the average preci-
`sion and recall measures can be superficially similar. Mix-
`ing simple phrases (such as 2-word adjacency phrases) to
`single terms for improved item representation is quite stan-
`dard in many systems (e.g., see Buckley, Salton, & Allan,
`1993) including ours that participated in TREC experi-
`ments. Investigators have also studied different representa-
`tions of the same query need such as multiple Boolean
`formulations (e.g., see Belkin et al. 1994), or Boolean and
`natural language formulation (e.g., see Kwok, Grunfeld, &
`Lewis, 1994), and either combine the queries or combine
`the retrieval lists in an effort to enhance retrieval. The
`INQUERY system (Croft, Callan, & Broglio, 1994) in par-
`ticular has explicitly built-in facilities to combine beliefs
`from different representations of the same query need that
`have been propagated from documents which can also have
`different representations. Examples of employing both long
`and short word representations for IR in Asian languages
`include the approaches by Ogawa and Matsuda (1997),
`Chien (1997), and Fujii and Croft (1993).
`This study presents investigations in using combination
`of representations for Chinese IR and reports some results.
`Section 2 discusses the different representation types and
`their value for concept indexing; in particular, Section 2.3
`discusses our method of short-word segmentation. Section 3
`presents a review of our PIRCS retrieval system, the col-
`lection statistics, and evaluation measures used. Section 4
`discusses results of our experiments using representations
`singly, mixed, as well as for retrieval list combination.
`Section 5 provides a cost and benefit discussion of combin-
`ing representations. Section 6 contains observations and
`conclusion.
`
`2. Simple Representation Types for Chinese
`Retrieval
`
`2.1. 1-gram (Single Character) Indexing
`
`1-gram indexing means for each sentence, removal of
`punctuation marks or symbols, and separation of the re-
`maining string into single characters. Every character is kept
`for representation purposes. Neither a dictionary of legal
`terms nor linguistic analysis is required and it is the simplest
`indexing method. Consider the title section of the TREC5
`
`Chinese topic #CH3:
`
`)s’
`) nuclear power plant (
`which means: Chinese (
`) operating (
`) condition (
`) when it is segmented
`(
`by words. However, there is no one “correct” segmentation:
`it is also possible to regard
`or
`as words,
`as words, and argue that
`or even single characters
`is a phrase like the English counterpart. When one
`uses character representation, it becomes:
`
`Single characters can be highly ambiguous in meaning
`depending on the context, and matching queries with doc-
`uments on characters only could adversely affect precision.
`can mean “center,” and only because it is in
`For example,
`juxtaposition with
`that the two become one word, which
`can
`is an abbreviation of the Peoples’ Republic of China.
`mean the verb “stand,” or a ‘stopping place’
`like
`(bus stop), or in this case “plant” or “station” be-
`(electricity).
`can be a word by itself
`cause it is next to
`(seedless grapes).
`meaning “pit” (of a fruit) as in
`and one deduces that it should be an
`Here it is next to
`meaning “nu-
`abbreviation of the two character word
`clear.” Thus, as far as attempting to recover the meaning of
`the string is concerned, 1-gram segmentation is a poor
`representation.
`On the other hand, as index terms for IR, 1-gram can have
`their value. This is because many short-words often belong to
`the same topical area as one of the characters forming the
`(battery),
`(light-
`word. For example, words like:
`(electric lamp),
`(motor-driven), all share
`ning),
`meaning (electricity, electric) and may char-
`the character
`acterize meaning on a higher hierarchy from which these
`words are derived. Characters probably have similar (perhaps
`less) precision as English stems do for indexing. When they are
`used, it guarantees that if there are correct word matches
`between queries and documents there will be 1-gram matches,
`and should be good for recall.
`The set of commonly used Chinese characters is small. In
`the GuoBiao-2573 encoding scheme, only 6573 are used.
`For the indexing of general newspaper articles, this should
`be compared to something like a couple of hundred thou-
`sand English indexing terms for collections of a few hun-
`dred megabytes. We would therefore expect each charac-
`ter’s use to spread over many more documents and their
`Zipfian curve to be comparatively flatter than English in-
`dexing terms. This may translate to higher imprecision.
`Nevertheless, retrieval by 1-gram may provide a basis from
`which one can measure improvements by other representa-
`tion methods.
`
`2.2. Bigram Indexing
`
`Another common and simple method for representation
`of Chinese is to use all consecutive overlapping 2-character
`pairs as indexing features (Chien, 1995; Liang, Lee, &
`
`710
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1999
`
`AOL Ex. 1025
`Page 2 of 15
`
`
`
`Yang, 1996). Again no dictionary, linguistic analysis, nor
`stopword removal need be performed, only punctuations are
`deleted. The same topic title used in Section 2.1. now
`becomes:
`
`An n-character string generates an n 2 1 set of bigrams. It
`has been estimated from a collection of a million Chinese
`words that over 8.8% word type is of length one, 67.8% of
`length two, 15.9% of length three and 6.8% of length four
`(tabulated in (He, 1983)). These numbers are in line with
`those quoted in Chen, He, Xu, Gey, and Meggs, (1997).
`Thus, over 99.3% of Chinese words are four characters or
`less in length. Since bigrams are exhaustively generated,
`they will cover all legitimate 2-character words in a text and
`therefore one may expect on average over 2
`3 of the Chinese
`words in a piece of text will be correctly indexed. They are
`also much more specific in meaning than single characters.
`In this example, the 1st, 3rd, 4th, 7th, and 9th bigrams are
`considered correct words. The 2nd, 5th, and 6th are noise
`terms having no particular meaning, while the 8th is a
`legitimate word but with the unwanted meaning “fate.”
`Thus, in this example, we may say that 5
`9 or ;55% of the
`terms are useful, while the other rather high amount of 45%
`are not.
`A word longer in length than two are indexed with
`multiple bigrams. If some of the bigrams within such words
`are rather characteristic of the word and occurring not too
`often in the collection, these bigrams will index the word
`well. The drawback with bigrams is that many meaningless
`character-pairs (such as those connecting two legitimate
`words) are also produced and it is not easy to eliminate them
`except if they are either very rare (occurring less than four
`times) or very often, when they can be removed by fre-
`quency considerations. The remaining bigrams could lead to
`noisy matchings between queries and documents, adversely
`impacting retrieval effectiveness. Normally, the number of
`unique bigrams generated from a corpus is a few times
`larger than for words.
`Also, Chinese words can be truly single character. For
`in
`example, simple nouns and verbs like:
`(amount of iron production),
`in
`(various prov-
`and
`in
`(visit America). Bigram represen-
`inces),
`tation would not index them properly. Because of this,
`investigators have also considered mixing representation
`methods in a document, like using 1-gram and bigram
`simultaneously to augment indexing (Ogawa & Iwasaki,
`1995).
`
`2.3. Short-Word Indexing
`
`In linguistic analysis, in which one tries to do language
`understanding, it is customary to isolate the most precise
`description of concepts and their relationships in a piece of
`text. These precise descriptions are often noun phrases. For
`example, the phrase “crude oil price rise” describes a pre-
`
`cise concept and should be regarded as a monolithic unit.
`For IR purposes, using phrases only as indexing terms are
`often too precise and would not be useful in general. One
`reason being that queries and documents often describe the
`same concept with phrases that are not exactly the same (for
`example, “increase of crude price,” “rise in oil price,” etc.),
`and there will be loss in their matching. While one may
`devise complicated partial matching techniques between
`query and document phrases, the simplest way to get around
`it is to use single words (and perhaps some adjacent 2-word
`phrases) as indexing terms. Thus, the forgoing phrase may
`give index terms like: “crude,” “oil,” “price,” “rise,” “crude
`oil,” “oil price,” “price rise.” Although it is intuitively felt
`that single words are highly ambiguous, many years of
`English IR investigation shows that these seem to be at
`the appropriate level of preciseness, and surpass phrase
`representation methods for large-scale retrieval purposes
`(Strzalkowski, Lin, Wang, Guthrie, Leistensnider, Wilding,
`Karlgren, Strazheim, & Perez-Carballo, 1997). One can add
`to single term result, but certainly not replace it.
`Based on these observations, we also believe that for
`Chinese IR, words would also be more preferable than
`phrases as indexing terms. Phrases are also referred to as
`long-words or compound words, while short ones are re-
`ferred to as short-words (Ogawa, Bessho, & Hirose, 1993;
`Fujii & Croft, 1993). As illustrated before, there is often
`disagreement of whether a string is a word or a phrase and
`can be broken down. We consider any meaningful two- or
`three-character string as a short-word. Proper nouns or
`idioms of four characters are also regarded as such. Thus,
`given a character sequence representing a concept or
`a proper noun like
`(Peoples’ Republic
`of China), we
`aim to segment
`and index it
`as
`. What one needs is a short-word seg-
`mentation algorithm to isolate them in a string of text.
`We also believe that accurate segmentation may not be
`as important as for other linguistic processing because IR
`works in a more forgiving environment. In IR, consistency
`of indexing is as important as (if not more than) the accurate
`choice of meaningful index terms. For example, one could,
`although not advisable, use chemistry terms to tag certain
`physics concepts in documents so long as both the indexers
`and the user community agree to use these terms consis-
`tently. They may look erroneous, but these terms do serve as
`a code agreed upon by both parties for retrieval purposes. In
`a somewhat similar way, errors in word segmentation would
`lead to meaningless codes; but so long as they are consis-
`tent, one would get similar wrong codes from both the query
`and the document if the same string is encountered and
`would lead to matchings between them. Similar mechanism
`may be working for bigrams as well when the true words are
`longer than two characters.
`We have devised a simple algorithm to aim at segment-
`ing texts into short-words of one to three characters long
`that function like English content terms for indexing pur-
`poses. It does not require heavy initial investments like
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1999
`
`711
`
`AOL Ex. 1025
`Page 3 of 15
`
`
`
`taggers or large dictionaries, and is reasonably fast for large
`corpora. The process is based on the following four steps,
`A–D:
`
`A) Facts—lookup on a list of legitimate short-words.
`Initially, we have manually created a 2,175-entry lexicon
`called L0. This is very small comparatively, consisting of
`commonly used words of one to three characters, with some
`proper nouns and idioms of size four. Each entry is tagged
`as 0 (useful: total 1,337), 1 (stopword: 671), s (symbol: 88),
`6 (numeric: 37), 4 (punctuation: 9), and 2 or 3 for the rules
`below. Other researchers have used lexicons of hundreds of
`thousands. We do not have such a large resource; besides,
`maintenance of such a list is not trivial. We try to remedy
`this via rules.
`
`Given an input string, we scan left to right and perform
`greedy matching for the longest pattern agreement on the
`lexicon. Any match will result in breaking a sentence into
`smaller fragments, which we call chunks, on both sides of
`the word. The matched term is processed according to its
`tag: either kept, if the tag is 0, or removed, if it is 1, s, 4, or
`6. Subsequently, we discovered that the presence of stop-
`words does not affect retrieval much, and have since then
`converted all words that are tagged 1 to tag 0. To keep
`processing efficient, we did not explore and select from all
`possible segmentation paths as done in Nie, Brisebois, and
`Ren (1996) and Ponte and Croft (1996).
`
`B) Rules—for performing further segmentation on the
`remaining chunks. Words in any language are dynamic and
`one can never capture “all” Chinese words in a lexicon for
`segmentation purposes. We attempt to identify some com-
`mon language usage ad-hoc rules that can be employed to
`further split the chunks into short-words. The rules that we
`use, together with their rationale and examples and counter-
`examples are described below:
`
`Rule D (for double): any two adjacent identical charac-
`ters xx are considered short-words—this identifies double
`same characters that are often used as adjectives or adverbs
`and often do not carry much content (see ex. 1–2 below).
`Some Chinese given names do use double same characters
`(ex. 3) and we would erroneously separate them from the
`family name. Other cases such as “U.S. Congress” (ex. 4)
`requires splitting between the same two characters. In these
`cases we rely on “U.S.” being on the lexicon and to be
`identified first before applying this rule. As in Step A, we
`can regard these double characters as stopwords and re-
`moved, or as short-words and kept. In both cases, it serves
`to segment the text at this position.
`Examples of Rule D:
`
`(1)
`(2)
`
`daily
`everywhere
`
`Counter-examples of Rule D:
`
`(3)
`(4)
`
`person name
`U.S. Congress
`
`Rule 2: px, where p « P is a small set of 31 special
`characters, are considered short-words for any x—these
`characters are tagged “2” in our lexicon and examples are
`shown below (ex. 5– 8). When character p is tagged “2,” we
`also try to identify common words where p is used as a word
`in the construct yp, and these are entered into the lexicon. A
`string like . . . ypx . . . would be split “yp x” rather than
`“y px,” dictionary entries being of higher precedence. Ini-
`tially, these Rule 2 words were considered as stopwords, but
`later they were changed to short-words for the same reason
`mentioned earlier. This rule works in many cases, but there
`are also many counter-examples (such as ex. 9 –11).
`Examples of Rule 2:
`
`P 5 {
`
`(5)
`(6)
`(7)
`(8)
`
`}
`
`a stick, one country
`(this, that) time
`together
`all walks of life
`
`Counter-examples of Rule 2:
`
`(9)
`(10)
`(11)
`
`unique
`this certainly
`oneself
`
`Rule 3: xq where q « Q currently has only 2 special
`characters, are short-words for any x—these are tagged “3”
`and is a complement to Rule 2 (see ex. 12–13 and counter-
`examples ex. 14 –15).
`Examples of Rule 3:
`
`Q 5 {
`
`}
`
`(12)
`(13)
`
`they, we
`more
`
`Counter-examples of Rule 3:
`
`(14)
`(15)
`
`teachers
`tiny amount
`
`Rule E (for even): any remaining sequence of even
`number of characters are segmented two by two—this arises
`from the observation that 60 –70% of Chinese words are
`2-characters long, and that the rhythm of Chinese is often
`bi-syllable punctuated with mono-syllables and tri-sylla-
`bles. If one can identify where the single- or triple-character
`words occur, the rest of the string quite often can be split as
`such. Examples 16 and 17 below show chunks (from TREC
`topics) that are even, and are surrounded by punctuation
`signs or detected short-words. They will be segmented
`correctly. Examples 18 and 19 show counter-examples with
`an even number of characters that do not obey this rule.
`This, together with Rule 2, are our primary methods for
`discovering new short-words that occur in the collection but
`not in our initial lexicon.
`
`712
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1999
`
`AOL Ex. 1025
`Page 4 of 15
`
`
`
`Examples of Rule E:
`(16)
`
`(17)
`
`most
`
`favored nation
`
`treatment
`
`separation
`
`representatives discuss bilateral trade relationship
`
`Counter-examples of Rule E:
`(18)
`
`(19)
`
`communist party
`
`member
`
`according to XinHua Agency
`
`report
`
`In addition, numeric entries are separated or removed as
`stopwords although one can often detect a sequence of them
`and have it identified as a number.
`
`in effect mixing short-word representation with
`words,
`1-grams.
`
`C) Frequency Filter—after a first pass through the test
`corpus via steps A and B, a list of candidate short-words
`will be generated with their frequency of occurrence. A
`threshold is used to extract the most commonly occurring
`ones. These are the new short-words that are discovered
`from the corpus itself.
`D) Iteration— using the newly identified short-words of
`Step C, we expand our initial lexicon in step A and re-
`process the corpus. In theory, we could continue to iterate,
`but we have only done one round. With a frequency thresh-
`old value in Step C of 30, a final lexicon size of 15,234
`called L01 was obtained when the small 2,175-entry L0 was
`used as the initial lexicon. We have later expanded the
`initial lexicon to 27,147 entries called L1, which got ex-
`panded to 42,822 and called L11. We have used all four lists
`for retrieval.
`
`When our simple segmentation method is done on the
`previous topic #CH3 title using the large L11 lexicon, it is
`split into:
`
`which is among the correct ones.
`We believe our method and the rules used in Step B,
`though simple, are useful. They naturally do not always
`work, but may work correctly often enough for IR purposes.
`Comparison with a manual short-word segmentation of the
`set of 28 TREC-5 queries shows that we achieve 91.3%
`recall and 83% precision on average. It is possible that these
`queries are easy to segment. Our method of segmentation is
`certainly too approximate for other applications such as
`machine translation, text-to-speech, etc. For IR, where the
`purpose is to detect documents with high probability of
`relevance rather than exact matching of meaning and is a
`more forgiving environment, it may be adequate.
`To further guard against wrong segmentation and there-
`fore loss in query-document matching, we additionally aug-
`ment our indexing with 1-grams generated from all short-
`
`3. The Retrieval Environment
`
`3.1. PIRCS Retrieval System
`
`For retrieval, we use our PIRCS engine that has been
`documented elsewhere (Kwok, 1990, 1995) and has partic-
`ipated in all the past TREC experiments with consistently
`admirable results (see our papers in previous TREC pro-
`ceedings). PIRCS employs an automatic, learning-based
`approach that is conceptualized as a 3-layer (query, term,
`and document) network with bi-directional,
`inter-layer
`weighted edge connections.
`It operates via activation
`spreading and combines probabilistic indexing and retrieval
`methods that can account for local as well as global term
`usage statistics. For each query (qa) and document (di) pair,
`the basic model evaluates a retrieval status value (RSV) as
`a combination of a query-focused process that spreads ac-
`tivation from document to query through common terms k
`giving RSVquery that simulates probabilistic retrieval, and an
`analogous document-focused process operating vice versa
`giving RSVdoc that simulates probabilistic indexing, as fol-
`lows:
`
`RSV 5 apRSVdoc 1 (1 2 a)pRSVquery
`
`5 apO
`
`S~qak/La!pwik 1 ~1 2 a!pO
`
`S~dik/Li!pwak
`
`k
`
`k
`
`where 0 , a , 1 is a constant, qak, dik are the frequency of
`term k in a query or document respectively, La, Li are the query
`or document lengths, and S(.) is a sigmoid-like function to
`suppress outlying values. For example, in the query-focused
`RSVquery, wak is defined as log[rp(1 2 s)/((1 2 r)ps)], (where
`r 5 Pr(term k presenturelevant to qa) and s 5 Pr(term k
`presentu;relevant to qa)), and is the log of the likelihood ratio
`of term k assuming term independence. The weight S(dik/Li) is
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1999
`
`713
`
`AOL Ex. 1025
`Page 5 of 15
`
`
`
`a measure of the contribution of the conceptual component
`term k in document i. The starting point of this formula is
`udocu-
`to evaluate a log odds value log [Pr(relevant to qa
`udocument i)] for ranking document
`ment i)/Pr(;relevant to qa
`i with respect to query a; this log odds value is transformed by
`Bayes’ Theorem, assumming items as constituted of concep-
`tual term components that are independent, and some other
`assumptions (Kwok, 1990). The RSV value attached to each
`document is then sorted to produce a ranked retrieval list for
`the user who poses the query.
`A major difference of our model from other similar
`probabilistic approaches is to treat a document or query as
`non-monolithic, but constituted of conceptual components
`(which we approximate as terms). This leads us to formulate
`in a collection of components rather than documents, and
`allows us to account for the non-binary occurrence of terms
`in items in a natural way. In the weighting formula for term
`5 log[rp(1 2 s)/((1 2 r)ps)] which has been
`k, wak
`defined earlier, r is set to a query self-learn value of qak/La,
`during initial retrieval when no relevant
`information is
`available except that the query is self-relevant, and s is set
`to Fk/M: the collection term frequency of k (Fk), divided by
`the total number of terms M used in the collection. We call
`the factor log(1 2 s)/s the inverse collection term fre-
`quency ICTF factor and differs from the usual inverse
`document frequency IDF, which counts using document
`frequency of a term and whole documents. Moreover, as the
`system learns from relevant documents, r can be trained to
`any value intermediate between the initial self-learn value
`and that given by the known relevants according to a learn-
`ing procedure (Kwok, 1995). Our system also deals with
`long documents and short queries in special ways. Long
`documents may cover widely different topics and only a
`small part of it may talk about the content of a query under
`search. If the document is treated as a whole, local proba-
`bility estimates of terms within the document may not be
`correct. We handle such documents by segmenting them
`into approximately equal length sub-documents. In a re-
`trieval, multiple sub-documents of the same document may
`rank high. For the final ranking, retrieval status values
`(RSV) of the top three sub-documents of the same docu-
`ment are combined with decreasing weights to return a final
`RSV. This in effect favors retrieval of longer documents
`that contain positive evidence in different sub-parts of it.
`With short queries, term importance within them cannot be
`distinguished by weighting. We borrow the term statistics
`from the whole collection to weight these query terms called
`avtf—average term frequency weight (Kwok, 1996).
`The adaptive capability of our network (Kwok, 1995) is
`materialized as a two-stage ad-hoc retrieval strategy. The
`first is an initial retrieval where a raw user query is em-
`ployed directly. The d best-ranked sub-documents from this
`retrieval are then regarded as relevant without user judg-
`ment, and employed as feedback data to train the initial
`query term weights and to add new terms to the query—a
`process of query expansion. This procedure has been called
`
`pseudo-feedback because there is no user judgment of doc-
`uments involved. This modified query retrieval then pro-
`vides the final result. This second stage retrieval in general
`can provide substantially better results than the initial if the
`initial retrieval is reasonable and has some number of rel-
`evants within the d best-ranked documents. The process is
`like having a dynamic thesaurus bringing in synonymous or
`related terms to enrich the raw query (Evans & Lefferts,
`1994). PIRCS brings these methodologies together and pro-
`vides a powerful retrieval engine for experimentation.
`
`3.2. Statistics of Collections, Queries, and Terms
`
`Our investigation was based on the TREC-5 & 6 Chinese
`collection of 24,988 XinHua and 139,801 People’s Daily
`news articles totaling about 170 MB. These are probably the
`largest evaluated text collections available for the Chinese
`language IR to date. To guard against very long articles
`which may lead to difficulties in inter-document compari-
`sons, they are divided into sub-documents of about 475
`characters in size ending on a paragraph boundary. (For
`bigrams, sub-documents sized about 625 are more prefera-
`ble). An example document from XinHua is shown in the
`Appendix. This produces a total of 247,685 (218,957 for
`bigrams) sub-documents which are then segmented into
`characters, bigrams, and short-words as described in Section
`2. For short-word representation, single characters from
`each word having a length of two or greater are also in-
`cluded for indexing purposes to minimize the problems of
`wrong segmentation. For bigrams, we show both with and
`without mixing characters.
`Provided with the TREC-5 & 6 experiments are two sets
`of 28 and 26 very long and rich Chinese topics, mostly on
`current affairs. For ease of comparison with previous TREC
`experiments, we continue to perform investigations using
`these two sets separately. One can always combine their
`performance measures into one by using the ratio 28:26. An
`example topic #CH19 is shown in the Appendix. They are
`processed like documents and form queries. These topics
`representing user needs have also been manually judged
`with respect to the (most fruitful part of the) collection at the
`National Standards of Science and Technology so that a set
`of relevant documents for each query is known. This allows
`retrieval results to be evaluated against known answers.
`Table 1 shows the various statistics generated by the col-
`lections and queries after our indexing procedures. It can be
`seen that bigram with characters and bigram indexing are
`the most space-consuming. The lengths of queries and doc-
`uments are obtained after thresholding with frequencies 3
`and 20K for all except character representation which uses
`3 and 29K (60K for short queries).
`
`3.3. Measures of Retrieval Effectiveness
`
`In TREC, it is customary to show effectiveness of a
`retrieval using precision tables at different points averaged
`
`714
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1999
`
`AOL Ex. 1025
`Page 6 of 15
`
`
`
`TABLE 1. Statistics generated from the collections and queries.
`
`1-gram (c)
`
`Bigram (bi)
`
`Bigram & 1-gram (bi.c)
`
`Short-Word & 1-gram (sw.c)
`
`INDEX TERMS
`
`Unique Types:
`
`Tokens:
`
`Unique Types:
`Tokens:
`DOCUMENTS
`#Sub-Docs:
`
`8,093
`(incl. some English)
`65,209,106
`
`5,890
`21,330,969
`
`247,685
`
`Doc Length:
`QUERIES (unique terms)
`Long Query:
`Short Query:
`
`Long Query:
`Short Query:
`
`86.1
`
`19.3
`7.00
`
`20.2
`6.0
`
`1,461,551
`
`Raw Text
`1,468,623
`
`58,463,482
`
`562,969
`42,690,777
`
`123,338,700
`After Thresholding
`568,212
`50,463,006
`
`218,957
`
`218,957
`
`Average
`
`192.7
`
`74.2
`11.1
`
`65.2
`10.3
`
`230.5
`TREC5 (28 queries):
`78.25
`12.9
`TREC6 (26 queries):
`80.88
`12.2
`
`410,212
`
`90,209,363
`
`102,835
`23,183,539
`
`247,685
`
`93.9
`
`35.9
`7.0
`
`37.7
`6.0
`
`over the set of queries under investigation such as shown in
`Table 2. Precision is defined as the proportion of retrieved
`documents that are relevant, and recall the proportion of
`relevant documents which are retrieved. In general when
`more documents are retrieved, precision falls as recall in-
`creases. Many summarizing measures shown in Table 2 can
`be used such as: number of relevant documents retrieved
`(Rel-ret) which means relevant documents (as judged by
`
`NIST manual evaluation) that are retrieved with a rank
`order before or equal to 1000, and should be compared with
`all the relevants found and listed in the row (Relevants). The
`1,000 threshold is set arbitrarily by NIST for comparison
`purposes. This would be a suitable measure for people
`primarily inte