throbber
Employing Multiple Representations for Chinese
`Information Retrieval
`
`K.L. Kwok
`Computer Science Department, Queens College, City University of New York, Flushing, NY 11367.
`E-mail: kwok@ir.cs.qc.edu URL: http://ir.cs.qc.edu
`
`For information retrieval in the Chinese language, three
`representation methods for texts are popular, namely:
`1-gram or character, bigram, and short-word. Each has
`its advantages as well as drawbacks. Employing more
`than one method may combine advantages from them
`and enhance retrieval effectiveness. We investigated
`two ways of using them simultaneously: mixing repre-
`sentations in documents and queries, and combining
`retrieval lists obtained via different representations. The
`experiments were done with the 170 MB evaluated Chi-
`nese corpora and 54 long and short queries available
`from the (TREC) program and using our Probabilistic
`Indexing and Retrieval Components System (PIRCS re-
`trieval system). Experiments show that good retrieval
`need not depend on accurate word segmentation; ap-
`proximate segmentation into short-words will do. Re-
`sults also show and confirm that bigram representation
`alone works well; mixing characters with bigram repre-
`sentation boosts effectiveness further, but it is prefera-
`ble to mix characters with short-word indexing which is
`more efficient, needs less resource, and gives better
`retrieval more often. Combining retrieval
`lists from
`short-word with character representation and from big-
`ram indexing provides the best retrieval results but also
`at a substantial cost.
`
`1. Introduction
`
`It is well known that a sentence in Chinese (and some
`other agglutinative Asian languages) consists of a continu-
`ous string of ‘characters’ without delimiting white spaces to
`identify words. In Chinese, the characters are called ideo-
`graphs. This makes it difficult to do computational studies
`on these languages since isolated words are needed for
`many purposes such as: linguistic analysis, machine trans-
`
`Some results in this paper have been published in previous conference
`proceedings, see (Kwok 97a,b).
`
`Received April 24, 1998; revised November 25, 1998; accepted February
`9, 1999.
`
`© 1999 John Wiley & Sons, Inc.
`
`lation, etc. Automatic methods for isolating words in a
`sentence—a process called word segmentation—is there-
`fore an important and necessary first step to be taken before
`other analysis can begin. Chinese words can be any number
`of characters in length, although the majority of the com-
`mon words are four characters or less (except for some
`proper nouns). Longer words are most probably phrases, but
`the boundaries between words and phrases are not clear.
`Many researchers have proposed practical methods to do
`word segmentation: (Chen & Kiu, 1992; Wu & Tsang,
`1995; Nie, Brisebois, & Ren, 1996; Jin & Chen, 1996; Ponte
`& Croft, 1996; Sproat, Shih, Gale, & Chang, 1996; Sun,
`Shen, & Huang, 1997).
`Information retrieval (IR) deals with the task of selecting
`relevant documents for a user need that is expressed in free
`text. The document collection is usually huge, gigabytes in
`size, and both queries and documents are domain unre-
`stricted and unpredictable. When one does IR in the Chinese
`language with its peculiar property, then one would expect
`that accurate word segmentation is also a crucial first step
`before other processing can begin.
`However, in the recent TREC-5 & 6 workshops (Text
`REtrieval Conferences sponsored by the National Institute
`of Standards & Technology (NIST) and the Defense Ad-
`vanced Research and Program Agency (DARPA)) where
`fairly large scale, blind Chinese retrieval experiments were
`performed (Kwok & Grunfeld, 1997; Kwok, Grunfeld, &
`Xu, 1998), we have demonstrated that an approximate
`short-word segmentation method, coupled with a powerful
`retrieval algorithm, is sufficient to provide very good re-
`sults. Moreover, experiments by us as well as by others
`using even simpler bigram representation of text (i.e., all
`consecutive overlapping two characters), both within and
`outside the TREC environment, also produce comparable
`results (Chien, 1995; Liang, Lee, & Yang, 1996; Chen, He,
`Xu, Gey, & Meggs, 1997; Kwok, 1997a). This is a bit
`counter-intuitive because the bigram method leads to over
`three times as large an indexing feature space compared
`with our segmentation (approximately 1.5 million vs. 0.4
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 50(8):709 –723, 1999
`
`CCC 0002-8231/99/080709-15
`
`AOL Ex. 1025
`Page 1 of 15
`
`

`

`million), and one would expect that there are many random,
`non-content matchings between queries and documents that
`may adversely affect outcome. Apparently, this is not so.
`Even 1-gram (single character) representation can return
`lesser but still quite respectable retrieval results (Kwok,
`1997a). Since IR is a difficult task, it may be helpful to
`combine these representations together to enhance retrieval
`since each representation type can have merits and strengths
`of its own.
`Employing multiple representations to improve retrieval
`is often done in English IR. Katzer et al. (1982) has shown
`that different document representations can lead to different
`sets of documents retrieved even though the average preci-
`sion and recall measures can be superficially similar. Mix-
`ing simple phrases (such as 2-word adjacency phrases) to
`single terms for improved item representation is quite stan-
`dard in many systems (e.g., see Buckley, Salton, & Allan,
`1993) including ours that participated in TREC experi-
`ments. Investigators have also studied different representa-
`tions of the same query need such as multiple Boolean
`formulations (e.g., see Belkin et al. 1994), or Boolean and
`natural language formulation (e.g., see Kwok, Grunfeld, &
`Lewis, 1994), and either combine the queries or combine
`the retrieval lists in an effort to enhance retrieval. The
`INQUERY system (Croft, Callan, & Broglio, 1994) in par-
`ticular has explicitly built-in facilities to combine beliefs
`from different representations of the same query need that
`have been propagated from documents which can also have
`different representations. Examples of employing both long
`and short word representations for IR in Asian languages
`include the approaches by Ogawa and Matsuda (1997),
`Chien (1997), and Fujii and Croft (1993).
`This study presents investigations in using combination
`of representations for Chinese IR and reports some results.
`Section 2 discusses the different representation types and
`their value for concept indexing; in particular, Section 2.3
`discusses our method of short-word segmentation. Section 3
`presents a review of our PIRCS retrieval system, the col-
`lection statistics, and evaluation measures used. Section 4
`discusses results of our experiments using representations
`singly, mixed, as well as for retrieval list combination.
`Section 5 provides a cost and benefit discussion of combin-
`ing representations. Section 6 contains observations and
`conclusion.
`
`2. Simple Representation Types for Chinese
`Retrieval
`
`2.1. 1-gram (Single Character) Indexing
`
`1-gram indexing means for each sentence, removal of
`punctuation marks or symbols, and separation of the re-
`maining string into single characters. Every character is kept
`for representation purposes. Neither a dictionary of legal
`terms nor linguistic analysis is required and it is the simplest
`indexing method. Consider the title section of the TREC5
`
`Chinese topic #CH3:
`
`)s’
`) nuclear power plant (
`which means: Chinese (
`) operating (
`) condition (
`) when it is segmented
`(
`by words. However, there is no one “correct” segmentation:
`it is also possible to regard
`or
`as words,
`as words, and argue that
`or even single characters
`is a phrase like the English counterpart. When one
`uses character representation, it becomes:
`
`Single characters can be highly ambiguous in meaning
`depending on the context, and matching queries with doc-
`uments on characters only could adversely affect precision.
`can mean “center,” and only because it is in
`For example,
`juxtaposition with
`that the two become one word, which
`can
`is an abbreviation of the Peoples’ Republic of China.
`mean the verb “stand,” or a ‘stopping place’
`like
`(bus stop), or in this case “plant” or “station” be-
`(electricity).
`can be a word by itself
`cause it is next to
`(seedless grapes).
`meaning “pit” (of a fruit) as in
`and one deduces that it should be an
`Here it is next to
`meaning “nu-
`abbreviation of the two character word
`clear.” Thus, as far as attempting to recover the meaning of
`the string is concerned, 1-gram segmentation is a poor
`representation.
`On the other hand, as index terms for IR, 1-gram can have
`their value. This is because many short-words often belong to
`the same topical area as one of the characters forming the
`(battery),
`(light-
`word. For example, words like:
`(electric lamp),
`(motor-driven), all share
`ning),
`meaning (electricity, electric) and may char-
`the character
`acterize meaning on a higher hierarchy from which these
`words are derived. Characters probably have similar (perhaps
`less) precision as English stems do for indexing. When they are
`used, it guarantees that if there are correct word matches
`between queries and documents there will be 1-gram matches,
`and should be good for recall.
`The set of commonly used Chinese characters is small. In
`the GuoBiao-2573 encoding scheme, only 6573 are used.
`For the indexing of general newspaper articles, this should
`be compared to something like a couple of hundred thou-
`sand English indexing terms for collections of a few hun-
`dred megabytes. We would therefore expect each charac-
`ter’s use to spread over many more documents and their
`Zipfian curve to be comparatively flatter than English in-
`dexing terms. This may translate to higher imprecision.
`Nevertheless, retrieval by 1-gram may provide a basis from
`which one can measure improvements by other representa-
`tion methods.
`
`2.2. Bigram Indexing
`
`Another common and simple method for representation
`of Chinese is to use all consecutive overlapping 2-character
`pairs as indexing features (Chien, 1995; Liang, Lee, &
`
`710
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1999
`
`AOL Ex. 1025
`Page 2 of 15
`
`

`

`Yang, 1996). Again no dictionary, linguistic analysis, nor
`stopword removal need be performed, only punctuations are
`deleted. The same topic title used in Section 2.1. now
`becomes:
`
`An n-character string generates an n 2 1 set of bigrams. It
`has been estimated from a collection of a million Chinese
`words that over 8.8% word type is of length one, 67.8% of
`length two, 15.9% of length three and 6.8% of length four
`(tabulated in (He, 1983)). These numbers are in line with
`those quoted in Chen, He, Xu, Gey, and Meggs, (1997).
`Thus, over 99.3% of Chinese words are four characters or
`less in length. Since bigrams are exhaustively generated,
`they will cover all legitimate 2-character words in a text and
`therefore one may expect on average over 2
`3 of the Chinese
`words in a piece of text will be correctly indexed. They are
`also much more specific in meaning than single characters.
`In this example, the 1st, 3rd, 4th, 7th, and 9th bigrams are
`considered correct words. The 2nd, 5th, and 6th are noise
`terms having no particular meaning, while the 8th is a
`legitimate word but with the unwanted meaning “fate.”
`Thus, in this example, we may say that 5
`9 or ;55% of the
`terms are useful, while the other rather high amount of 45%
`are not.
`A word longer in length than two are indexed with
`multiple bigrams. If some of the bigrams within such words
`are rather characteristic of the word and occurring not too
`often in the collection, these bigrams will index the word
`well. The drawback with bigrams is that many meaningless
`character-pairs (such as those connecting two legitimate
`words) are also produced and it is not easy to eliminate them
`except if they are either very rare (occurring less than four
`times) or very often, when they can be removed by fre-
`quency considerations. The remaining bigrams could lead to
`noisy matchings between queries and documents, adversely
`impacting retrieval effectiveness. Normally, the number of
`unique bigrams generated from a corpus is a few times
`larger than for words.
`Also, Chinese words can be truly single character. For
`in
`example, simple nouns and verbs like:
`(amount of iron production),
`in
`(various prov-
`and
`in
`(visit America). Bigram represen-
`inces),
`tation would not index them properly. Because of this,
`investigators have also considered mixing representation
`methods in a document, like using 1-gram and bigram
`simultaneously to augment indexing (Ogawa & Iwasaki,
`1995).
`
`2.3. Short-Word Indexing
`
`In linguistic analysis, in which one tries to do language
`understanding, it is customary to isolate the most precise
`description of concepts and their relationships in a piece of
`text. These precise descriptions are often noun phrases. For
`example, the phrase “crude oil price rise” describes a pre-
`
`cise concept and should be regarded as a monolithic unit.
`For IR purposes, using phrases only as indexing terms are
`often too precise and would not be useful in general. One
`reason being that queries and documents often describe the
`same concept with phrases that are not exactly the same (for
`example, “increase of crude price,” “rise in oil price,” etc.),
`and there will be loss in their matching. While one may
`devise complicated partial matching techniques between
`query and document phrases, the simplest way to get around
`it is to use single words (and perhaps some adjacent 2-word
`phrases) as indexing terms. Thus, the forgoing phrase may
`give index terms like: “crude,” “oil,” “price,” “rise,” “crude
`oil,” “oil price,” “price rise.” Although it is intuitively felt
`that single words are highly ambiguous, many years of
`English IR investigation shows that these seem to be at
`the appropriate level of preciseness, and surpass phrase
`representation methods for large-scale retrieval purposes
`(Strzalkowski, Lin, Wang, Guthrie, Leistensnider, Wilding,
`Karlgren, Strazheim, & Perez-Carballo, 1997). One can add
`to single term result, but certainly not replace it.
`Based on these observations, we also believe that for
`Chinese IR, words would also be more preferable than
`phrases as indexing terms. Phrases are also referred to as
`long-words or compound words, while short ones are re-
`ferred to as short-words (Ogawa, Bessho, & Hirose, 1993;
`Fujii & Croft, 1993). As illustrated before, there is often
`disagreement of whether a string is a word or a phrase and
`can be broken down. We consider any meaningful two- or
`three-character string as a short-word. Proper nouns or
`idioms of four characters are also regarded as such. Thus,
`given a character sequence representing a concept or
`a proper noun like
`(Peoples’ Republic
`of China), we
`aim to segment
`and index it
`as
`. What one needs is a short-word seg-
`mentation algorithm to isolate them in a string of text.
`We also believe that accurate segmentation may not be
`as important as for other linguistic processing because IR
`works in a more forgiving environment. In IR, consistency
`of indexing is as important as (if not more than) the accurate
`choice of meaningful index terms. For example, one could,
`although not advisable, use chemistry terms to tag certain
`physics concepts in documents so long as both the indexers
`and the user community agree to use these terms consis-
`tently. They may look erroneous, but these terms do serve as
`a code agreed upon by both parties for retrieval purposes. In
`a somewhat similar way, errors in word segmentation would
`lead to meaningless codes; but so long as they are consis-
`tent, one would get similar wrong codes from both the query
`and the document if the same string is encountered and
`would lead to matchings between them. Similar mechanism
`may be working for bigrams as well when the true words are
`longer than two characters.
`We have devised a simple algorithm to aim at segment-
`ing texts into short-words of one to three characters long
`that function like English content terms for indexing pur-
`poses. It does not require heavy initial investments like
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1999
`
`711
`
`AOL Ex. 1025
`Page 3 of 15
`
`

`

`taggers or large dictionaries, and is reasonably fast for large
`corpora. The process is based on the following four steps,
`A–D:
`
`A) Facts—lookup on a list of legitimate short-words.
`Initially, we have manually created a 2,175-entry lexicon
`called L0. This is very small comparatively, consisting of
`commonly used words of one to three characters, with some
`proper nouns and idioms of size four. Each entry is tagged
`as 0 (useful: total 1,337), 1 (stopword: 671), s (symbol: 88),
`6 (numeric: 37), 4 (punctuation: 9), and 2 or 3 for the rules
`below. Other researchers have used lexicons of hundreds of
`thousands. We do not have such a large resource; besides,
`maintenance of such a list is not trivial. We try to remedy
`this via rules.
`
`Given an input string, we scan left to right and perform
`greedy matching for the longest pattern agreement on the
`lexicon. Any match will result in breaking a sentence into
`smaller fragments, which we call chunks, on both sides of
`the word. The matched term is processed according to its
`tag: either kept, if the tag is 0, or removed, if it is 1, s, 4, or
`6. Subsequently, we discovered that the presence of stop-
`words does not affect retrieval much, and have since then
`converted all words that are tagged 1 to tag 0. To keep
`processing efficient, we did not explore and select from all
`possible segmentation paths as done in Nie, Brisebois, and
`Ren (1996) and Ponte and Croft (1996).
`
`B) Rules—for performing further segmentation on the
`remaining chunks. Words in any language are dynamic and
`one can never capture “all” Chinese words in a lexicon for
`segmentation purposes. We attempt to identify some com-
`mon language usage ad-hoc rules that can be employed to
`further split the chunks into short-words. The rules that we
`use, together with their rationale and examples and counter-
`examples are described below:
`
`Rule D (for double): any two adjacent identical charac-
`ters xx are considered short-words—this identifies double
`same characters that are often used as adjectives or adverbs
`and often do not carry much content (see ex. 1–2 below).
`Some Chinese given names do use double same characters
`(ex. 3) and we would erroneously separate them from the
`family name. Other cases such as “U.S. Congress” (ex. 4)
`requires splitting between the same two characters. In these
`cases we rely on “U.S.” being on the lexicon and to be
`identified first before applying this rule. As in Step A, we
`can regard these double characters as stopwords and re-
`moved, or as short-words and kept. In both cases, it serves
`to segment the text at this position.
`Examples of Rule D:
`
`(1)
`(2)
`
`daily
`everywhere
`
`Counter-examples of Rule D:
`
`(3)
`(4)
`
`person name
`U.S. Congress
`
`Rule 2: px, where p « P is a small set of 31 special
`characters, are considered short-words for any x—these
`characters are tagged “2” in our lexicon and examples are
`shown below (ex. 5– 8). When character p is tagged “2,” we
`also try to identify common words where p is used as a word
`in the construct yp, and these are entered into the lexicon. A
`string like . . . ypx . . . would be split “yp x” rather than
`“y px,” dictionary entries being of higher precedence. Ini-
`tially, these Rule 2 words were considered as stopwords, but
`later they were changed to short-words for the same reason
`mentioned earlier. This rule works in many cases, but there
`are also many counter-examples (such as ex. 9 –11).
`Examples of Rule 2:
`
`P 5 {
`
`(5)
`(6)
`(7)
`(8)
`
`}
`
`a stick, one country
`(this, that) time
`together
`all walks of life
`
`Counter-examples of Rule 2:
`
`(9)
`(10)
`(11)
`
`unique
`this certainly
`oneself
`
`Rule 3: xq where q « Q currently has only 2 special
`characters, are short-words for any x—these are tagged “3”
`and is a complement to Rule 2 (see ex. 12–13 and counter-
`examples ex. 14 –15).
`Examples of Rule 3:
`
`Q 5 {
`
`}
`
`(12)
`(13)
`
`they, we
`more
`
`Counter-examples of Rule 3:
`
`(14)
`(15)
`
`teachers
`tiny amount
`
`Rule E (for even): any remaining sequence of even
`number of characters are segmented two by two—this arises
`from the observation that 60 –70% of Chinese words are
`2-characters long, and that the rhythm of Chinese is often
`bi-syllable punctuated with mono-syllables and tri-sylla-
`bles. If one can identify where the single- or triple-character
`words occur, the rest of the string quite often can be split as
`such. Examples 16 and 17 below show chunks (from TREC
`topics) that are even, and are surrounded by punctuation
`signs or detected short-words. They will be segmented
`correctly. Examples 18 and 19 show counter-examples with
`an even number of characters that do not obey this rule.
`This, together with Rule 2, are our primary methods for
`discovering new short-words that occur in the collection but
`not in our initial lexicon.
`
`712
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1999
`
`AOL Ex. 1025
`Page 4 of 15
`
`

`

`Examples of Rule E:
`(16)
`
`(17)
`
`most
`
`favored nation
`
`treatment
`
`separation
`
`representatives discuss bilateral trade relationship
`
`Counter-examples of Rule E:
`(18)
`
`(19)
`
`communist party
`
`member
`
`according to XinHua Agency
`
`report
`
`In addition, numeric entries are separated or removed as
`stopwords although one can often detect a sequence of them
`and have it identified as a number.
`
`in effect mixing short-word representation with
`words,
`1-grams.
`
`C) Frequency Filter—after a first pass through the test
`corpus via steps A and B, a list of candidate short-words
`will be generated with their frequency of occurrence. A
`threshold is used to extract the most commonly occurring
`ones. These are the new short-words that are discovered
`from the corpus itself.
`D) Iteration— using the newly identified short-words of
`Step C, we expand our initial lexicon in step A and re-
`process the corpus. In theory, we could continue to iterate,
`but we have only done one round. With a frequency thresh-
`old value in Step C of 30, a final lexicon size of 15,234
`called L01 was obtained when the small 2,175-entry L0 was
`used as the initial lexicon. We have later expanded the
`initial lexicon to 27,147 entries called L1, which got ex-
`panded to 42,822 and called L11. We have used all four lists
`for retrieval.
`
`When our simple segmentation method is done on the
`previous topic #CH3 title using the large L11 lexicon, it is
`split into:
`
`which is among the correct ones.
`We believe our method and the rules used in Step B,
`though simple, are useful. They naturally do not always
`work, but may work correctly often enough for IR purposes.
`Comparison with a manual short-word segmentation of the
`set of 28 TREC-5 queries shows that we achieve 91.3%
`recall and 83% precision on average. It is possible that these
`queries are easy to segment. Our method of segmentation is
`certainly too approximate for other applications such as
`machine translation, text-to-speech, etc. For IR, where the
`purpose is to detect documents with high probability of
`relevance rather than exact matching of meaning and is a
`more forgiving environment, it may be adequate.
`To further guard against wrong segmentation and there-
`fore loss in query-document matching, we additionally aug-
`ment our indexing with 1-grams generated from all short-
`
`3. The Retrieval Environment
`
`3.1. PIRCS Retrieval System
`
`For retrieval, we use our PIRCS engine that has been
`documented elsewhere (Kwok, 1990, 1995) and has partic-
`ipated in all the past TREC experiments with consistently
`admirable results (see our papers in previous TREC pro-
`ceedings). PIRCS employs an automatic, learning-based
`approach that is conceptualized as a 3-layer (query, term,
`and document) network with bi-directional,
`inter-layer
`weighted edge connections.
`It operates via activation
`spreading and combines probabilistic indexing and retrieval
`methods that can account for local as well as global term
`usage statistics. For each query (qa) and document (di) pair,
`the basic model evaluates a retrieval status value (RSV) as
`a combination of a query-focused process that spreads ac-
`tivation from document to query through common terms k
`giving RSVquery that simulates probabilistic retrieval, and an
`analogous document-focused process operating vice versa
`giving RSVdoc that simulates probabilistic indexing, as fol-
`lows:
`
`RSV 5 apRSVdoc 1 (1 2 a)pRSVquery
`
`5 apO
`
`S~qak/La!pwik 1 ~1 2 a!pO
`
`S~dik/Li!pwak
`
`k
`
`k
`
`where 0 , a , 1 is a constant, qak, dik are the frequency of
`term k in a query or document respectively, La, Li are the query
`or document lengths, and S(.) is a sigmoid-like function to
`suppress outlying values. For example, in the query-focused
`RSVquery, wak is defined as log[rp(1 2 s)/((1 2 r)ps)], (where
`r 5 Pr(term k presenturelevant to qa) and s 5 Pr(term k
`presentu;relevant to qa)), and is the log of the likelihood ratio
`of term k assuming term independence. The weight S(dik/Li) is
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1999
`
`713
`
`AOL Ex. 1025
`Page 5 of 15
`
`

`

`a measure of the contribution of the conceptual component
`term k in document i. The starting point of this formula is
`udocu-
`to evaluate a log odds value log [Pr(relevant to qa
`udocument i)] for ranking document
`ment i)/Pr(;relevant to qa
`i with respect to query a; this log odds value is transformed by
`Bayes’ Theorem, assumming items as constituted of concep-
`tual term components that are independent, and some other
`assumptions (Kwok, 1990). The RSV value attached to each
`document is then sorted to produce a ranked retrieval list for
`the user who poses the query.
`A major difference of our model from other similar
`probabilistic approaches is to treat a document or query as
`non-monolithic, but constituted of conceptual components
`(which we approximate as terms). This leads us to formulate
`in a collection of components rather than documents, and
`allows us to account for the non-binary occurrence of terms
`in items in a natural way. In the weighting formula for term
`5 log[rp(1 2 s)/((1 2 r)ps)] which has been
`k, wak
`defined earlier, r is set to a query self-learn value of qak/La,
`during initial retrieval when no relevant
`information is
`available except that the query is self-relevant, and s is set
`to Fk/M: the collection term frequency of k (Fk), divided by
`the total number of terms M used in the collection. We call
`the factor log(1 2 s)/s the inverse collection term fre-
`quency ICTF factor and differs from the usual inverse
`document frequency IDF, which counts using document
`frequency of a term and whole documents. Moreover, as the
`system learns from relevant documents, r can be trained to
`any value intermediate between the initial self-learn value
`and that given by the known relevants according to a learn-
`ing procedure (Kwok, 1995). Our system also deals with
`long documents and short queries in special ways. Long
`documents may cover widely different topics and only a
`small part of it may talk about the content of a query under
`search. If the document is treated as a whole, local proba-
`bility estimates of terms within the document may not be
`correct. We handle such documents by segmenting them
`into approximately equal length sub-documents. In a re-
`trieval, multiple sub-documents of the same document may
`rank high. For the final ranking, retrieval status values
`(RSV) of the top three sub-documents of the same docu-
`ment are combined with decreasing weights to return a final
`RSV. This in effect favors retrieval of longer documents
`that contain positive evidence in different sub-parts of it.
`With short queries, term importance within them cannot be
`distinguished by weighting. We borrow the term statistics
`from the whole collection to weight these query terms called
`avtf—average term frequency weight (Kwok, 1996).
`The adaptive capability of our network (Kwok, 1995) is
`materialized as a two-stage ad-hoc retrieval strategy. The
`first is an initial retrieval where a raw user query is em-
`ployed directly. The d best-ranked sub-documents from this
`retrieval are then regarded as relevant without user judg-
`ment, and employed as feedback data to train the initial
`query term weights and to add new terms to the query—a
`process of query expansion. This procedure has been called
`
`pseudo-feedback because there is no user judgment of doc-
`uments involved. This modified query retrieval then pro-
`vides the final result. This second stage retrieval in general
`can provide substantially better results than the initial if the
`initial retrieval is reasonable and has some number of rel-
`evants within the d best-ranked documents. The process is
`like having a dynamic thesaurus bringing in synonymous or
`related terms to enrich the raw query (Evans & Lefferts,
`1994). PIRCS brings these methodologies together and pro-
`vides a powerful retrieval engine for experimentation.
`
`3.2. Statistics of Collections, Queries, and Terms
`
`Our investigation was based on the TREC-5 & 6 Chinese
`collection of 24,988 XinHua and 139,801 People’s Daily
`news articles totaling about 170 MB. These are probably the
`largest evaluated text collections available for the Chinese
`language IR to date. To guard against very long articles
`which may lead to difficulties in inter-document compari-
`sons, they are divided into sub-documents of about 475
`characters in size ending on a paragraph boundary. (For
`bigrams, sub-documents sized about 625 are more prefera-
`ble). An example document from XinHua is shown in the
`Appendix. This produces a total of 247,685 (218,957 for
`bigrams) sub-documents which are then segmented into
`characters, bigrams, and short-words as described in Section
`2. For short-word representation, single characters from
`each word having a length of two or greater are also in-
`cluded for indexing purposes to minimize the problems of
`wrong segmentation. For bigrams, we show both with and
`without mixing characters.
`Provided with the TREC-5 & 6 experiments are two sets
`of 28 and 26 very long and rich Chinese topics, mostly on
`current affairs. For ease of comparison with previous TREC
`experiments, we continue to perform investigations using
`these two sets separately. One can always combine their
`performance measures into one by using the ratio 28:26. An
`example topic #CH19 is shown in the Appendix. They are
`processed like documents and form queries. These topics
`representing user needs have also been manually judged
`with respect to the (most fruitful part of the) collection at the
`National Standards of Science and Technology so that a set
`of relevant documents for each query is known. This allows
`retrieval results to be evaluated against known answers.
`Table 1 shows the various statistics generated by the col-
`lections and queries after our indexing procedures. It can be
`seen that bigram with characters and bigram indexing are
`the most space-consuming. The lengths of queries and doc-
`uments are obtained after thresholding with frequencies 3
`and 20K for all except character representation which uses
`3 and 29K (60K for short queries).
`
`3.3. Measures of Retrieval Effectiveness
`
`In TREC, it is customary to show effectiveness of a
`retrieval using precision tables at different points averaged
`
`714
`
`JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1999
`
`AOL Ex. 1025
`Page 6 of 15
`
`

`

`TABLE 1. Statistics generated from the collections and queries.
`
`1-gram (c)
`
`Bigram (bi)
`
`Bigram & 1-gram (bi.c)
`
`Short-Word & 1-gram (sw.c)
`
`INDEX TERMS
`
`Unique Types:
`
`Tokens:
`
`Unique Types:
`Tokens:
`DOCUMENTS
`#Sub-Docs:
`
`8,093
`(incl. some English)
`65,209,106
`
`5,890
`21,330,969
`
`247,685
`
`Doc Length:
`QUERIES (unique terms)
`Long Query:
`Short Query:
`
`Long Query:
`Short Query:
`
`86.1
`
`19.3
`7.00
`
`20.2
`6.0
`
`1,461,551
`
`Raw Text
`1,468,623
`
`58,463,482
`
`562,969
`42,690,777
`
`123,338,700
`After Thresholding
`568,212
`50,463,006
`
`218,957
`
`218,957
`
`Average
`
`192.7
`
`74.2
`11.1
`
`65.2
`10.3
`
`230.5
`TREC5 (28 queries):
`78.25
`12.9
`TREC6 (26 queries):
`80.88
`12.2
`
`410,212
`
`90,209,363
`
`102,835
`23,183,539
`
`247,685
`
`93.9
`
`35.9
`7.0
`
`37.7
`6.0
`
`over the set of queries under investigation such as shown in
`Table 2. Precision is defined as the proportion of retrieved
`documents that are relevant, and recall the proportion of
`relevant documents which are retrieved. In general when
`more documents are retrieved, precision falls as recall in-
`creases. Many summarizing measures shown in Table 2 can
`be used such as: number of relevant documents retrieved
`(Rel-ret) which means relevant documents (as judged by
`
`NIST manual evaluation) that are retrieved with a rank
`order before or equal to 1000, and should be compared with
`all the relevants found and listed in the row (Relevants). The
`1,000 threshold is set arbitrarily by NIST for comparison
`purposes. This would be a suitable measure for people
`primarily inte

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket