`
`www.elsevier.com/locate/infoproman
`
`Dictionary-based techniques for cross-language
`information retrieval q
`
`Gina-Anne Levow a,*, Douglas W. Oard b, Philip Resnik c
`
`a Department of Computer Science, University of Chicago, 1100 E. 58th Street, Chicago, IL 60637, USA
`b College of Information Studies and Institute for Advanced Computer Studies, University of Maryland,
`College Park, MD 20742, USA
`c Department of Linguistics and Institute for Advanced Computer Studies, University of Maryland,
`College Park, MD 20742, USA
`
`Received 10 June 2004; accepted 14 June 2004
`Available online 19 August 2004
`
`Abstract
`
`Cross-language information retrieval (CLIR) systems allow users to find documents written in different languages
`from that of their query. Simple knowledge structures such as bilingual term lists have proven to be a remarkably useful
`basis for bridging that language gap. A broad array of dictionary-based techniques have demonstrated utility, but com-
`parison across techniques has been difficult because evaluation results often span only a limited range of conditions.
`This article identifies the key issues in dictionary-based CLIR, develops unified frameworks for term selection and term
`translation that help to explain the relationships among existing techniques, and illustrates the effect of those techniques
`using four contrasting languages for systematic experiments with a uniform query translation architecture. Key results
`include identification of a previously unseen dependence of pre- and post-translation expansion on orthographic cog-
`nates and development of a query-specific measure for translation fanout that helps to explain the utility of structured
`query methods.
`Ó 2004 Elsevier Ltd. All rights reserved.
`
`Keywords: Cross-language information retrieval; Ranked retrieval; Dictionary-based translation
`
`q This work was supported in part by DARPA contract N6600197 C8540, DARPA cooperative agreement N660010028910, and
`NSF grant EIA0130422.
`* Corresponding author. Tel.: +1 773 702 5680; fax: +1 773 702 8487.
`E-mail addresses: levow@cs.uchicago.edu (G.-A. Levow), oard@glue.umd.edu (D.W. Oard), resnik@umiacs.umd.edu (P. Resnik).
`
`0306-4573/$ - see front matter Ó 2004 Elsevier Ltd. All rights reserved.
`doi:10.1016/j.ipm.2004.06.012
`
`AOL Ex. 1027
`Page 1 of 25
`
`
`
`524
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`1. Introduction
`
`In the book of Genesis, the following passage describing the impact of linguistic diversity on mankindÕs
`ability to create great works (in this case, the Tower of Babel) seems particularly apt to the situation we
`observe on the Internet today:
`
`‘‘Behold, they are one people, and they have all one language; and this is only the beginning of what
`they will do; and nothing that they propose to do will now be impossible for them. Come, let us go
`down, and there confuse their language, that they may not understand one anotherÕs speech.’’
`
`Of course, many linguists might dispute this explanation for the diversity evident in human language.
`Whatever the cause, overcoming the language barrier has been the focus of great interest and substantial
`investment since the dawn of the computer age. Early efforts proved to be disappointing, in part because
`the theory, techniques and resources available at the time were not sufficient to automatically produce flu-
`ent translation of unrestricted text (ALP, 1966). The situation has improved somewhat in recent years, as
`new techniques have been developed (Brown et al., 1990 and successors) and because the emergence of the
`World Wide Web has provided a strong forcing function. Much of the present focus of application devel-
`opment has been characterized by Church and Hovy as seeking ‘‘good applications for crummy machine
`translation’’ (Church & Hovy, 1993). Among the uses that have been found, few have been as successful
`as cross-language information retrieval (CLIR).
`The goal of a CLIR system is to help searchers find documents that are written in languages that are
`different from the language in which their query is expressed. 1 This can be done by constructing a mapping
`between the query and document languages, or by mapping both the query and document representations
`into some third feature space. The first approach is often referred to as ‘‘query translation’’ if done at query
`time and as ‘‘document translation’’ if done at indexing time, but in practice both approaches require that
`document-language evidence be used to compute query-language term weights that can then be combined
`as if the documents had been written in the query language.
`In all cases, however, a key element is the mechanism to map between languages. This translation knowl-
`edge can be encoded in different forms––as a data structure of query and document-language term corre-
`spondences in a machine-readable dictionary or as an algorithm, such as a machine translation or
`machine transliteration system. While all of these forms are effective, the latter require substantial investment
`in time and resources for development and thus may not be widely or readily available for many language
`pairs. Therefore, we focus in this article on the machine-readable dictionary in its simplest form, a bilingual
`term list. Because of its simplicity, such pairwise lists of translation correspondences are readily available for
`many language pairs and are relatively easy to construct if unavailable. We identify techniques that allow the
`CLIR system to best exploit these simple resources and concurrently identify general issues and approaches
`that have bearing on CLIR techniques that employ more complex encodings of translation knowledge.
`In this article, we draw together a body of work that has not previously been accessible in a single source
`in a way that makes three key contributions. First, we present a holistic view of issues that have previously
`been presented only in isolation, and often in different communities (e.g., information retrieval and com-
`putational linguistics). Second, we introduce a unified framework based on mapping evidence about mean-
`ing across languages, casting widely used techniques such as balanced and structured translation in that
`framework as a way of illustrating their relative strengths. And third, we present a comprehensive set of
`
`1 We use the term ‘‘document’’ broadly here to mean any linguistic expression, whether stored as character code in a computer,
`printed on paper, or spoken. For ease of presentation, we assume that documents in other forms are converted into character codes
`using optical character recognition or speech recognition prior to indexing, and do not treat those details further in this article.
`
`AOL Ex. 1027
`Page 2 of 25
`
`
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`525
`
`contrastive experiments to illustrate the effects of each technique that we describe, together with new in-
`sights based on those results.
`Our goal in this article is not merely to describe the state of the art, but to illustrate the effect of the tech-
`niques that we describe on retrieval effectiveness for languages with different characteristics. This naturally
`leads to the question of what system architecture to choose in order to make informative comparisons and
`what measures to use to make those comparisons. We have chosen a query translation architecture that
`illustrates the full range of opportunities to improve retrieval effectiveness. Furthermore, as a practical con-
`sideration, repeated trials with alternate query translation techniques are more easily run than those with
`alternate document translation techniques. But our goal in this paper is to present a framework for consid-
`ering the fundamental issues in dictionary-based CLIR, and those issues will naturally be important con-
`siderations in the design of any dictionary-based CLIR system, regardless of the specific architecture
`adopted. In our experiments, we demonstrate the impact of various techniques on retrieval effectiveness
`using standard large-scale test collections for languages exhibiting a range of interesting linguistic pheno-
`mena. Specifically, we perform experiments using English language queries with document collections in
`French and Mandarin Chinese for all experimental conditions, 2 and German and Arabic to illustrate some
`specialized processing.
`Fig. 1 illustrates the data flow between the key components in our reference architecture. Our dictionary-
`based query translation architecture consists of two streams of processing, for the query and documents.
`Close observation will reveal substantial parallelism in the processing of these two streams, as well as sym-
`metry in pre- and post-translation query processing. Specifically, we exploit methods for suitable term
`extraction and pseudo-relevance feedback expansion at three main points in the retrieval architecture:
`before document indexing, before query translation, and after query translation. The discussion and exper-
`iments throughout the paper highlight both similarities in the techniques employed at these different stages
`of processing and differences in the goals and optimization criteria necessary at each stage. Different targets
`for matching––in the dictionary for pre-translation processing and between the document and translated
`queries at the other points––influence the specific strategies used as well as the effectiveness of the tech-
`niques. The translation process bridges the language gap, and the information retrieval system finally per-
`forms the actual query to document match producing a ranked list of documents. Comparison of this
`ranked list to relevance judgments yields our experimental figure of merit.
`The remainder of this article is organized as follows. 3 Section 2 describes in some detail document
`processing for a dictionary-based cross-language setting, introducing the specific techniques used in our
`experiments. Sections 3 and 4 through 4.2 provide a similar level of detail for query processing and for
`the use of translation knowledge to map between the document and query languages. Experiment results
`that illustrate the effects of specific techniques appear in Section 5.
`
`2. Document processing
`
`Having discussed our CLIR reference architecture in general terms, we proceed in this section to a dis-
`cussion of document processing that considers the issues and the alternative methods in greater detail, while
`also introducing the specific methods used in our later experiments.
`In this section we focus on two critical elements in document processing, index term extraction and doc-
`ument expansion, with an emphasis on issues relevant in a cross-language setting. These two steps can be
`
`2 We applied document expansion only to French.
`3 Owing to space limitations, we assume that the reader is familiar with the central issues and techniques for ranked retrieval in
`monolingual applications. See Frakes and Baeza-Yates (1992) for relevant background.
`
`AOL Ex. 1027
`Page 3 of 25
`
`
`
`526
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`Fig. 1. CLIR architecture. Letters are keyed to section numbers where components are discussed. Bolding indicates key contributions
`of this paper.
`
`viewed as components in constructing the representation of document content that will be used in retrieval.
`In the first step, a document is characterized by the set of terms that appear within––though, as we shall see,
`we may wish to interpret appear somewhat indirectly. An English document about crude oil pipelines in
`Afghanistan can be characterized by terms like oil, pipelines, and Afghanistan; one might also wish to
`include in the characterization words like pipeline and pipe that appear implicitly, the better to match terms
`that could appear in queries seeking documents like this one.
`
`AOL Ex. 1027
`Page 4 of 25
`
`
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`527
`
`In the second step, one takes this idea of implicitly represented terms further. If the document contains
`terms like oil, pipelines, and Afghanistan, the concepts underlying the document probably also involve terms
`like petroleum, gas, capacity, and kilometer. Document expansion is the process of adding such terms to the
`document representation, thereby making explicit those terms that are sufficiently related to the document
`in a conceptual sense.
`
`2.1. Extraction of indexing terms
`
`Index term extraction is relatively straightforward in English, but freely compounding languages such as
`German and unsegmented languages such as Chinese pose additional challenges. As is well known, the eas-
`iest way to extract indexing terms from a document, recognizing tokens separated by white-space, is often
`too simple. Here we describe a range of techniques that provide better results given the challenges presented
`by a representative range of languages including English, French, Arabic, Chinese, and German. These
`approaches fall into two main categories, automatically segmenting the text stream into a single sequence
`of non-overlapping words, which might then be subjected to further processing such as stemming, and
`indexing overlapping character sequences.
`
`2.1.1. English and French: tokenization, clitic splitting, and stemming
`English and French are written with generally space-delimited words, and simple pattern-based ap-
`proaches to tokenization work well for separating words from punctuation. In both English and French,
`clitic splitting is employed to separate morphemes connected to a word by an apostrophe. Techniques typ-
`ically consist of a twofold process of separating the clitic and then expanding it, e.g., mÕaidez ) mÕ
`aidez ) me aidez.
`After tokenization as above, the resulting words are often normalized via either morphological analysis
`(e.g., Koskenniemi, 1983), mapping inflected verbs such as aidez to root forms such as aider, or, more com-
`monly, by the process of stemming, which typically involves application of a set of rules for removal of pre-
`fixes or suffixes or both. The widely used rule-based approach to stemming pioneered by Porter (1980)
`stems continua, continuer, continuait, continuera, continuant, continuerait, continuation, continueront, con-
`tinue, continuez, continue´ , and continuite´ to the single stem continu.
`Notice that unlike morphological normalization, which usually preserves part-of-speech distinctions,
`stemming freely collapses across parts of speech, e.g., noun continuite´ and verb continuez reduce to the same
`stem, and may freely produce terms that are not actually words in the language. This increases the likeli-
`hood of a match when normalized document and query representations are used.
`In our experiments on French documents, we first applied a two-stage clitic separation and expansion
`approach. We then applied a rule-based Porter-style stemmer, freely available from http://xapian.org, to
`normalize across morphological variants.
`
`2.1.2. Arabic: complex morphology
`The morphology of Arabic is far more complex than that of English or French. Adopting a generative
`view of Arabic morphology, template-based character insertions are used to convert generalized ‘‘roots’’
`(such as ‘‘ktb’’ [in the standard transliteration], which serves as the base form for many words that have
`to do with writing) into more specific ‘‘stems’’ (such as ‘‘ktAb,’’ which means ‘‘book’’). Prefixes and suffixes
`can then be added to these stems to form words, and some common modifiers can be adjoined to the begin-
`ning or end of a word to form a limited class of compound forms (e.g., ‘‘wktAbAn’’, ‘‘and two books’’). It is
`often the case that several roots could be used to generate the same Arabic token, so token-level analysis is
`often highly ambiguous. State of the art techniques such as two-level finite-state morphology (Beesley,
`1998) therefore typically generate several possible, but sometimes highly improbable, analyses. Three
`approaches to this challenge are possible. The first is to do a full analysis and then select the most probable
`
`AOL Ex. 1027
`Page 5 of 25
`
`
`
`528
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`results based on corpus statistics and/or context (e.g., Darwish, 2002). A widely used alternative is to in-
`stead apply rule-based techniques to remove common prefixes and suffixes (whether from morphology
`or compounding) to produce something akin to English or French stemming. This approach is typically
`referred to as ‘‘light stemming’’ in Arabic, since the resulting ‘‘stems’’ sometimes differ from what would
`be produced by a full linguistic analysis (Aljlayl & Frieder, 2002). A third approach is corpus-based clus-
`tering, in which terms found in some other way (e.g., through light stemming) are grouped into classes
`based on their distributional characteristics (e.g., De Roeck & Al-Fares, 2000). We used the first two
`approaches for the illustrative experiments described in this article that involve Arabic (Darwish, 2002).
`In our experiments we formed four kinds of terms:
`
`• Token, in which only white-space was stripped.
`• Linguistic stems, in which affixes were stripped using the most likely analysis from the Sebawai morpho-
`logical analyzer (Darwish, 2002).
`• Linguistic roots,
`in which the most likely Sebawai analysis was used to identify the root (e.g.,
`alkitab ) ktb).
`• Lightly stemmed words, in which affixes were automatically stripped using a simple rule-based system
`(Al-stem) (e.g., alkitab ) kitab).
`
`2.1.3. Chinese: word segmentation
`Spoken languages generally lack any explicit marking of the breaks between words, and some written
`languages exhibit similar characteristics, lacking between-word spaces or other indications of word bound-
`aries. Two approaches to term extraction are possible in such cases: (1) automatic segmentation, and (2)
`overlapping character n-grams. We have chosen Chinese to illustrate these approaches.
`Automatic segmentation techniques typically model the task as selecting a partition on the sequence of
`characters that corresponds to word boundary positions (although variants that include aspects of stem-
`ming or expansion of contractions have also been explored). A wide variety of techniques have been devel-
`oped, but all can be cast in a framework of model-based optimization. The simplest example is longest
`substring matching, in which a sentence is traversed from left to right, removing the longest dictionary term
`that begins at the present position (or a single character, if no dictionary term is found). This corresponds
`to minimizing the number of characters that are not covered by a term found in the dictionary using a gree-
`dy search strategy. The key ideas in this framework are the function to be optimized (in this case, a function
`of the chosen partition) and the search strategy to be used to explore the space of possible partitions.
`Other optimization functions incorporate the degree of fit to hand-segmented training data (for super-
`vised techniques) (Emerson, 2001) or measures of consistency such as minimum description length (for
`unsupervised techniques). Greedy search strategies are widely used, but dynamic programming (Barras,
`Geoffrois, Wu, & Liberman, 1998) or exhaustive enumeration (Jin, 1992) are also sometimes employed.
`The difficulty of accurate word segmentation for Chinese has led to extensive use of overlapping char-
`acter n-grams for indexing Chinese. The idea is to eschew the notion of a definitive segmentation altogether,
`and instead generate all the character n-grams of a fixed width observed in the text. This abandons any sem-
`blance of interpretability for the extracted terms, but it does have the advantage of producing term repre-
`sentations that support good matches when they exist, as well as permitting partial matches.
`To illustrate with an English word, if a query contains the word china, then the trigram terms generated
`from the query will provide matches against documents containing occurrences not only of china, but also
`chinese and indochina, since trigrams chi and hin are shared. This provides the effect of stemming without
`the necessity of identifying the token boundaries.
`The same approach can also be used for languages with white-space delimited tokens, of course. In prac-
`tice, the choice of n-gram width varies by language, and tends to correlate with the average size of a mor-
`
`AOL Ex. 1027
`Page 6 of 25
`
`
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`529
`
`phological unit––hence English is best represented using n-grams for n in the vicinity of 5 (Mayfield &
`McNamee, 1999), and Chinese is best represented with n of 2 (Meng et al., 2001; Wilkinson, 1997). 4
`For the illustrative experiments in this article that involve Chinese, we have experimented with both heu-
`ristic longest-match segmentation using the NMSU segmenter (Jin, 1992) and with terms based on Chinese
`character bigrams (Section 5.4).
`
`2.1.4. German: decompounding
`Although German uses white-space to separate words, its well known productivity with respect to com-
`pound words raises issues of within-word segmentation similar to those of Chinese. German decompound-
`ing can therefore be viewed in the same optimization and search framework. We apply a dictionary-based,
`greedy approach in our experiments, using the German side of our German–English bilingual term list as
`the segmentation dictionary. Morphological normalization for German terms––either pre- or post-decom-
`pounding––can be addressed using the same sorts of normalization approaches discussed above for English
`and French.
`
`2.2. Document expansion
`
`A quintessential problem in information retrieval is the fact that the same underlying ideas can be rep-
`resented in many different ways in the observed text. One approach to this problem is to represent under-
`lying concepts as hidden variables in a probabilistic model (Kraaij & Hiemstra, 1998; Ponte & Croft, 1997).
`Another is to treat term-based representations as incomplete, and to expand them to include terms repre-
`sentative of the underlying concepts that cannot be extracted explicitly from the text itself. This is the basis
`for widely used query expansion techniques in monolingual information retrieval.
`While query expansion is a well-established technique for both monolingual and cross-language infor-
`mation retrieval, document expansion has only recently been applied to these tasks. The document expan-
`sion approach was first proposed by Singhal and Pereira (1999) in the context of spoken document
`retrieval. Since spoken document retrieval involves search of error-prone automatic speech recognition
`transcriptions, Singhal et al. introduced document expansion as a way of recovering those words that might
`have been in the original broadcast but that had been misrecognized. Their results showed that correctly
`recognized terms yield a topically coherent transcript, while the errors tend not to co-occur in comparable
`documents. Using the document as a query to a comparable collection typically yields documents that con-
`tain some related terms that are highly selective; when those terms are added to the document, improved
`retrieval effectiveness was observed.
`The same idea can be applied in CLIR to find words that the author might have used; this can achieve an
`effect similar to post-translation query expansion. We expanded the original news stories with the most
`selective terms from related documents. First all documents underwent basic term extraction as described
`above. Then each document was reformatted as a query in which all terms were weighted equally. We used
`the full document collection as a comparable collection to be searched for enriching terms. We then selected
`the top five ranked documents (excluding the original document itself) as sources of expansion terms. Next
`we chose highly selective terms from these documents by ranking terms in decreasing order based on inverse
`document frequency. We added an instance of each term to the original document up to one less than the
`number of expansion documents in which it appeared, until we had approximately doubled the original
`
`4 Chinese characters carry semantic content in ways English characters do not, so on average words are approximately two
`characters long and individual characters often carry some recognizable component of meaning.
`
`AOL Ex. 1027
`Page 7 of 25
`
`
`
`530
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`document length. 5 This process sought to maintain the fidelity of the term frequency component for term
`weighting. Finally, we indexed the resulting expanded documents.
`
`3. Query processing
`
`As is the case for document processing, the processing of queries in dictionary-based CLIR depends on
`extraction of query terms in order to represent the information need, perhaps with query expansion to aug-
`ment that representation with additional related terms. In this section we elaborate on the details of pre-
`translation query processing; translation issues are then taken up in Section 4.
`
`3.1. Pre-translation term extraction
`
`In Section 2.1, we focused on the extraction of terms from documents in order to provide term-based
`representations for retrieval. The process of term extraction from queries involves essentially the same lin-
`guistic issues and utilizes many of the same techniques. There is, however, a key difference when extracting
`terms from queries in a query-translation architecture: the terms extracted from queries are going to form
`the basis for translation, not for matching.
`Matching techniques (in monolingual applications) typically seek to enhance recall by conflating differ-
`ences using stemming and segmentation processes, at the expense of some precision. In contrast, for dic-
`tionary-based CLIR, a key concern is mitigating the effects of ambiguity, where multiple terms with
`multiple senses result in an explosion of translation alternatives. We focus on matching the dictionary at
`the highest level of selectivity to minimize ambiguity, and then applying backoff strategies to enhance cov-
`erage only when exact match translations are unavailable.
`As an additional tactic to reduce ambiguity, we take advantage of the observation that multi-word
`expressions rarely have more than one interpretation (e.g., the word ‘‘house’’ in ‘‘White House’’ cannot
`be translated in the verb sense meaning ‘‘to shelter’’); translating multi-word expressions as a unit is well
`known to be helpful for CLIR (Ballesteros & Croft, 1997). Since the task of identifying useful multi-word
`expressions can be viewed a variant of the segmentation optimization problem, we applied the greedy long-
`est-match technique described in Section 2.1.
`
`3.2. Pre-translation expansion
`
`Query expansion is a well-established technique in monolingual information retrieval. Very short (2–3
`word) queries are common in some applications (e.g., Web search). Expansion can help to compensate
`for this kind of incomplete specification of the information need. Brevity can yield ambiguity (reducing pre-
`cision), or may result in omission of terms that are used by authors of the documents that are sought
`(reducing recall). Query expansion using pseudo-relevance feedback had been shown to partially overcome
`these difficulties (Buckley, Salton, Allan, & Singhal, 1994). While expansion in general increases the mean
`retrieval effectiveness, it may also concurrently increase variance across queries.
`Ballesteros and Croft (1997) evaluated pre- and post-translation query expansion in a Spanish–English
`cross-language information retrieval task and found that combining pre- and post-translation query expan-
`sion improved both precision and recall, with pre-translation expansion improving both precision and re-
`call, and post-translation expansion enhancing precision. Mayfield and MacNameeÕs ablation experiments
`
`5 More efficient implementations of the document expansion procedure are possible, but this simple approach suffices to illustrate
`the term selection process and its contribution.
`
`AOL Ex. 1027
`Page 8 of 25
`
`
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`531
`
`on the effect of translation resource size on pre- and post-translation query expansion effectiveness demon-
`strated the dominant role of pre-translation expansion in providing translatable terms (McNamee & May-
`field, 2002). If too few terms are translated, post-translation expansion can provide little improvement. In
`pre-translation query expansion, our goal is both that of monolingual query expansion––providing addi-
`tional terms to refine the query and to enhance the probability of matching the terminology chosen by
`the authors of the document––and providing additional terms to limit the possibility of failing to translate
`a concept in the query simply because the particular term is not present in the translation lexicon.
`We performed the expansion as follows. We constructed the initial query in the normal manner for
`INQUERY (Callan, Croft, & Harding, 1992). We then used INQUERYÕs relevance feedback process to
`obtain expansion terms based on the 10 highest ranked retrieved documents from the contemporaneous
`1994 Los Angeles Times documents, part of the Cross-Language Evaluation Forum (CLEF) (Peters,
`2001) 2000 corpus. Experiments with three, five, and ten expansion documents were conducted with min-
`imal difference in resulting retrieval effectiveness. We concatenated the expansion term set to the original
`query and used the resulting query as the basis for translation. 6
`
`4. Translation knowledge and query translation
`
`For CLIR, translation knowledge provides the crucial bridge between the userÕs information need
`expressed in one language and document concepts expressed the document language. While approaches
`using off-the-shelf machine translation systems alone or in combination with other translation resources
`have been shown to be effective for CLIR tasks (Gey, Jiang, Chen, & Larson, 1998), they are limited to
`the relatively small number of language pairs for which such systems exist. Since our goal is to focus on
`broadly applicable techniques, we focus on the simplest form of a translation lexicon, bilingual term lists,
`which are already available for many language pairs and can be constructed relatively easily for others. A
`bilingual term list is an unordered set of query-language/document-language term translation pairs, often
`with no translation preference or part-of-speech information. In this section, we first describe the bilingual
`term lists that we used in the CLIR experiments reported below. We next describe a general methodology
`for backoff translation that enhances dictionary coverage. We then describe two main strategies for inte-
`grating translation evidence and managing ambiguity through different term weighting techniques. We con-
`clude by presenting two methods to further enhance matching of the translated query with the document
`index through term extraction and expansion processes.
`
`4.1. Bilingual term lists and optimizing coverage
`
`Bilingual term lists are easily found on the Web, often having been created initially for use with simple
`online bilingual dictionary programs. However, since these term lists were constructed for diverse purposes
`and may derive from diverse sources, their ready availability is a great advantage but their possibly eclectic
`structure also presents challenges. They may vary dramatically along several dimensions, including number
`of entries, source, number of multi-word entries, degree of ambiguity, and mix of surface or root form en-
`tries. A key challenge for dictionary-based CLIR systems is to develop techniques to most fully exploit
`these resources while minimizing any negative impact of their more problematic characteristics. A charac-
`terization of the bilingual term lists used in our experiments appears in Table 1. In the cases of English
`
`6 In general, query expansion should involve term reweighting as well, both increasing and decreasing; here we adopt the simple
`strategy of uniform weighting of original and expansion terms.
`
`AOL Ex. 1027
`Page 9 of 25
`
`
`
`532
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`Table 1
`Characterization of bilingual term lists
`
`Translation resource
`
`# English terms
`
`# Document-language terms
`
`Source
`
`English–French
`English–Arabic
`English–Chinese
`English–German
`
`20,100
`137,235
`199,444
`99,357
`
`35,008
`179,152
`395,216
`131,273
`
`http://www.freedict.com
`Web word translations
`CETA + http://www.ldc.upenn.edu
`http://www.quickdic.de
`
`paired with French, Chinese, 7 and German, we used existing static bilingual dictionary resources. Lacking
`an English–Arabic resourc