throbber
Information Processing and Management 41 (2005) 523–547
`
`www.elsevier.com/locate/infoproman
`
`Dictionary-based techniques for cross-language
`information retrieval q
`
`Gina-Anne Levow a,*, Douglas W. Oard b, Philip Resnik c
`
`a Department of Computer Science, University of Chicago, 1100 E. 58th Street, Chicago, IL 60637, USA
`b College of Information Studies and Institute for Advanced Computer Studies, University of Maryland,
`College Park, MD 20742, USA
`c Department of Linguistics and Institute for Advanced Computer Studies, University of Maryland,
`College Park, MD 20742, USA
`
`Received 10 June 2004; accepted 14 June 2004
`Available online 19 August 2004
`
`Abstract
`
`Cross-language information retrieval (CLIR) systems allow users to find documents written in different languages
`from that of their query. Simple knowledge structures such as bilingual term lists have proven to be a remarkably useful
`basis for bridging that language gap. A broad array of dictionary-based techniques have demonstrated utility, but com-
`parison across techniques has been difficult because evaluation results often span only a limited range of conditions.
`This article identifies the key issues in dictionary-based CLIR, develops unified frameworks for term selection and term
`translation that help to explain the relationships among existing techniques, and illustrates the effect of those techniques
`using four contrasting languages for systematic experiments with a uniform query translation architecture. Key results
`include identification of a previously unseen dependence of pre- and post-translation expansion on orthographic cog-
`nates and development of a query-specific measure for translation fanout that helps to explain the utility of structured
`query methods.
`Ó 2004 Elsevier Ltd. All rights reserved.
`
`Keywords: Cross-language information retrieval; Ranked retrieval; Dictionary-based translation
`
`q This work was supported in part by DARPA contract N6600197 C8540, DARPA cooperative agreement N660010028910, and
`NSF grant EIA0130422.
`* Corresponding author. Tel.: +1 773 702 5680; fax: +1 773 702 8487.
`E-mail addresses: levow@cs.uchicago.edu (G.-A. Levow), oard@glue.umd.edu (D.W. Oard), resnik@umiacs.umd.edu (P. Resnik).
`
`0306-4573/$ - see front matter Ó 2004 Elsevier Ltd. All rights reserved.
`doi:10.1016/j.ipm.2004.06.012
`
`AOL Ex. 1027
`Page 1 of 25
`
`

`

`524
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`1. Introduction
`
`In the book of Genesis, the following passage describing the impact of linguistic diversity on mankindÕs
`ability to create great works (in this case, the Tower of Babel) seems particularly apt to the situation we
`observe on the Internet today:
`
`‘‘Behold, they are one people, and they have all one language; and this is only the beginning of what
`they will do; and nothing that they propose to do will now be impossible for them. Come, let us go
`down, and there confuse their language, that they may not understand one anotherÕs speech.’’
`
`Of course, many linguists might dispute this explanation for the diversity evident in human language.
`Whatever the cause, overcoming the language barrier has been the focus of great interest and substantial
`investment since the dawn of the computer age. Early efforts proved to be disappointing, in part because
`the theory, techniques and resources available at the time were not sufficient to automatically produce flu-
`ent translation of unrestricted text (ALP, 1966). The situation has improved somewhat in recent years, as
`new techniques have been developed (Brown et al., 1990 and successors) and because the emergence of the
`World Wide Web has provided a strong forcing function. Much of the present focus of application devel-
`opment has been characterized by Church and Hovy as seeking ‘‘good applications for crummy machine
`translation’’ (Church & Hovy, 1993). Among the uses that have been found, few have been as successful
`as cross-language information retrieval (CLIR).
`The goal of a CLIR system is to help searchers find documents that are written in languages that are
`different from the language in which their query is expressed. 1 This can be done by constructing a mapping
`between the query and document languages, or by mapping both the query and document representations
`into some third feature space. The first approach is often referred to as ‘‘query translation’’ if done at query
`time and as ‘‘document translation’’ if done at indexing time, but in practice both approaches require that
`document-language evidence be used to compute query-language term weights that can then be combined
`as if the documents had been written in the query language.
`In all cases, however, a key element is the mechanism to map between languages. This translation knowl-
`edge can be encoded in different forms––as a data structure of query and document-language term corre-
`spondences in a machine-readable dictionary or as an algorithm, such as a machine translation or
`machine transliteration system. While all of these forms are effective, the latter require substantial investment
`in time and resources for development and thus may not be widely or readily available for many language
`pairs. Therefore, we focus in this article on the machine-readable dictionary in its simplest form, a bilingual
`term list. Because of its simplicity, such pairwise lists of translation correspondences are readily available for
`many language pairs and are relatively easy to construct if unavailable. We identify techniques that allow the
`CLIR system to best exploit these simple resources and concurrently identify general issues and approaches
`that have bearing on CLIR techniques that employ more complex encodings of translation knowledge.
`In this article, we draw together a body of work that has not previously been accessible in a single source
`in a way that makes three key contributions. First, we present a holistic view of issues that have previously
`been presented only in isolation, and often in different communities (e.g., information retrieval and com-
`putational linguistics). Second, we introduce a unified framework based on mapping evidence about mean-
`ing across languages, casting widely used techniques such as balanced and structured translation in that
`framework as a way of illustrating their relative strengths. And third, we present a comprehensive set of
`
`1 We use the term ‘‘document’’ broadly here to mean any linguistic expression, whether stored as character code in a computer,
`printed on paper, or spoken. For ease of presentation, we assume that documents in other forms are converted into character codes
`using optical character recognition or speech recognition prior to indexing, and do not treat those details further in this article.
`
`AOL Ex. 1027
`Page 2 of 25
`
`

`

`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`525
`
`contrastive experiments to illustrate the effects of each technique that we describe, together with new in-
`sights based on those results.
`Our goal in this article is not merely to describe the state of the art, but to illustrate the effect of the tech-
`niques that we describe on retrieval effectiveness for languages with different characteristics. This naturally
`leads to the question of what system architecture to choose in order to make informative comparisons and
`what measures to use to make those comparisons. We have chosen a query translation architecture that
`illustrates the full range of opportunities to improve retrieval effectiveness. Furthermore, as a practical con-
`sideration, repeated trials with alternate query translation techniques are more easily run than those with
`alternate document translation techniques. But our goal in this paper is to present a framework for consid-
`ering the fundamental issues in dictionary-based CLIR, and those issues will naturally be important con-
`siderations in the design of any dictionary-based CLIR system, regardless of the specific architecture
`adopted. In our experiments, we demonstrate the impact of various techniques on retrieval effectiveness
`using standard large-scale test collections for languages exhibiting a range of interesting linguistic pheno-
`mena. Specifically, we perform experiments using English language queries with document collections in
`French and Mandarin Chinese for all experimental conditions, 2 and German and Arabic to illustrate some
`specialized processing.
`Fig. 1 illustrates the data flow between the key components in our reference architecture. Our dictionary-
`based query translation architecture consists of two streams of processing, for the query and documents.
`Close observation will reveal substantial parallelism in the processing of these two streams, as well as sym-
`metry in pre- and post-translation query processing. Specifically, we exploit methods for suitable term
`extraction and pseudo-relevance feedback expansion at three main points in the retrieval architecture:
`before document indexing, before query translation, and after query translation. The discussion and exper-
`iments throughout the paper highlight both similarities in the techniques employed at these different stages
`of processing and differences in the goals and optimization criteria necessary at each stage. Different targets
`for matching––in the dictionary for pre-translation processing and between the document and translated
`queries at the other points––influence the specific strategies used as well as the effectiveness of the tech-
`niques. The translation process bridges the language gap, and the information retrieval system finally per-
`forms the actual query to document match producing a ranked list of documents. Comparison of this
`ranked list to relevance judgments yields our experimental figure of merit.
`The remainder of this article is organized as follows. 3 Section 2 describes in some detail document
`processing for a dictionary-based cross-language setting, introducing the specific techniques used in our
`experiments. Sections 3 and 4 through 4.2 provide a similar level of detail for query processing and for
`the use of translation knowledge to map between the document and query languages. Experiment results
`that illustrate the effects of specific techniques appear in Section 5.
`
`2. Document processing
`
`Having discussed our CLIR reference architecture in general terms, we proceed in this section to a dis-
`cussion of document processing that considers the issues and the alternative methods in greater detail, while
`also introducing the specific methods used in our later experiments.
`In this section we focus on two critical elements in document processing, index term extraction and doc-
`ument expansion, with an emphasis on issues relevant in a cross-language setting. These two steps can be
`
`2 We applied document expansion only to French.
`3 Owing to space limitations, we assume that the reader is familiar with the central issues and techniques for ranked retrieval in
`monolingual applications. See Frakes and Baeza-Yates (1992) for relevant background.
`
`AOL Ex. 1027
`Page 3 of 25
`
`

`

`526
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`Fig. 1. CLIR architecture. Letters are keyed to section numbers where components are discussed. Bolding indicates key contributions
`of this paper.
`
`viewed as components in constructing the representation of document content that will be used in retrieval.
`In the first step, a document is characterized by the set of terms that appear within––though, as we shall see,
`we may wish to interpret appear somewhat indirectly. An English document about crude oil pipelines in
`Afghanistan can be characterized by terms like oil, pipelines, and Afghanistan; one might also wish to
`include in the characterization words like pipeline and pipe that appear implicitly, the better to match terms
`that could appear in queries seeking documents like this one.
`
`AOL Ex. 1027
`Page 4 of 25
`
`

`

`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`527
`
`In the second step, one takes this idea of implicitly represented terms further. If the document contains
`terms like oil, pipelines, and Afghanistan, the concepts underlying the document probably also involve terms
`like petroleum, gas, capacity, and kilometer. Document expansion is the process of adding such terms to the
`document representation, thereby making explicit those terms that are sufficiently related to the document
`in a conceptual sense.
`
`2.1. Extraction of indexing terms
`
`Index term extraction is relatively straightforward in English, but freely compounding languages such as
`German and unsegmented languages such as Chinese pose additional challenges. As is well known, the eas-
`iest way to extract indexing terms from a document, recognizing tokens separated by white-space, is often
`too simple. Here we describe a range of techniques that provide better results given the challenges presented
`by a representative range of languages including English, French, Arabic, Chinese, and German. These
`approaches fall into two main categories, automatically segmenting the text stream into a single sequence
`of non-overlapping words, which might then be subjected to further processing such as stemming, and
`indexing overlapping character sequences.
`
`2.1.1. English and French: tokenization, clitic splitting, and stemming
`English and French are written with generally space-delimited words, and simple pattern-based ap-
`proaches to tokenization work well for separating words from punctuation. In both English and French,
`clitic splitting is employed to separate morphemes connected to a word by an apostrophe. Techniques typ-
`ically consist of a twofold process of separating the clitic and then expanding it, e.g., mÕaidez ) mÕ
`aidez ) me aidez.
`After tokenization as above, the resulting words are often normalized via either morphological analysis
`(e.g., Koskenniemi, 1983), mapping inflected verbs such as aidez to root forms such as aider, or, more com-
`monly, by the process of stemming, which typically involves application of a set of rules for removal of pre-
`fixes or suffixes or both. The widely used rule-based approach to stemming pioneered by Porter (1980)
`stems continua, continuer, continuait, continuera, continuant, continuerait, continuation, continueront, con-
`tinue, continuez, continue´ , and continuite´ to the single stem continu.
`Notice that unlike morphological normalization, which usually preserves part-of-speech distinctions,
`stemming freely collapses across parts of speech, e.g., noun continuite´ and verb continuez reduce to the same
`stem, and may freely produce terms that are not actually words in the language. This increases the likeli-
`hood of a match when normalized document and query representations are used.
`In our experiments on French documents, we first applied a two-stage clitic separation and expansion
`approach. We then applied a rule-based Porter-style stemmer, freely available from http://xapian.org, to
`normalize across morphological variants.
`
`2.1.2. Arabic: complex morphology
`The morphology of Arabic is far more complex than that of English or French. Adopting a generative
`view of Arabic morphology, template-based character insertions are used to convert generalized ‘‘roots’’
`(such as ‘‘ktb’’ [in the standard transliteration], which serves as the base form for many words that have
`to do with writing) into more specific ‘‘stems’’ (such as ‘‘ktAb,’’ which means ‘‘book’’). Prefixes and suffixes
`can then be added to these stems to form words, and some common modifiers can be adjoined to the begin-
`ning or end of a word to form a limited class of compound forms (e.g., ‘‘wktAbAn’’, ‘‘and two books’’). It is
`often the case that several roots could be used to generate the same Arabic token, so token-level analysis is
`often highly ambiguous. State of the art techniques such as two-level finite-state morphology (Beesley,
`1998) therefore typically generate several possible, but sometimes highly improbable, analyses. Three
`approaches to this challenge are possible. The first is to do a full analysis and then select the most probable
`
`AOL Ex. 1027
`Page 5 of 25
`
`

`

`528
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`results based on corpus statistics and/or context (e.g., Darwish, 2002). A widely used alternative is to in-
`stead apply rule-based techniques to remove common prefixes and suffixes (whether from morphology
`or compounding) to produce something akin to English or French stemming. This approach is typically
`referred to as ‘‘light stemming’’ in Arabic, since the resulting ‘‘stems’’ sometimes differ from what would
`be produced by a full linguistic analysis (Aljlayl & Frieder, 2002). A third approach is corpus-based clus-
`tering, in which terms found in some other way (e.g., through light stemming) are grouped into classes
`based on their distributional characteristics (e.g., De Roeck & Al-Fares, 2000). We used the first two
`approaches for the illustrative experiments described in this article that involve Arabic (Darwish, 2002).
`In our experiments we formed four kinds of terms:
`
`• Token, in which only white-space was stripped.
`• Linguistic stems, in which affixes were stripped using the most likely analysis from the Sebawai morpho-
`logical analyzer (Darwish, 2002).
`• Linguistic roots,
`in which the most likely Sebawai analysis was used to identify the root (e.g.,
`alkitab ) ktb).
`• Lightly stemmed words, in which affixes were automatically stripped using a simple rule-based system
`(Al-stem) (e.g., alkitab ) kitab).
`
`2.1.3. Chinese: word segmentation
`Spoken languages generally lack any explicit marking of the breaks between words, and some written
`languages exhibit similar characteristics, lacking between-word spaces or other indications of word bound-
`aries. Two approaches to term extraction are possible in such cases: (1) automatic segmentation, and (2)
`overlapping character n-grams. We have chosen Chinese to illustrate these approaches.
`Automatic segmentation techniques typically model the task as selecting a partition on the sequence of
`characters that corresponds to word boundary positions (although variants that include aspects of stem-
`ming or expansion of contractions have also been explored). A wide variety of techniques have been devel-
`oped, but all can be cast in a framework of model-based optimization. The simplest example is longest
`substring matching, in which a sentence is traversed from left to right, removing the longest dictionary term
`that begins at the present position (or a single character, if no dictionary term is found). This corresponds
`to minimizing the number of characters that are not covered by a term found in the dictionary using a gree-
`dy search strategy. The key ideas in this framework are the function to be optimized (in this case, a function
`of the chosen partition) and the search strategy to be used to explore the space of possible partitions.
`Other optimization functions incorporate the degree of fit to hand-segmented training data (for super-
`vised techniques) (Emerson, 2001) or measures of consistency such as minimum description length (for
`unsupervised techniques). Greedy search strategies are widely used, but dynamic programming (Barras,
`Geoffrois, Wu, & Liberman, 1998) or exhaustive enumeration (Jin, 1992) are also sometimes employed.
`The difficulty of accurate word segmentation for Chinese has led to extensive use of overlapping char-
`acter n-grams for indexing Chinese. The idea is to eschew the notion of a definitive segmentation altogether,
`and instead generate all the character n-grams of a fixed width observed in the text. This abandons any sem-
`blance of interpretability for the extracted terms, but it does have the advantage of producing term repre-
`sentations that support good matches when they exist, as well as permitting partial matches.
`To illustrate with an English word, if a query contains the word china, then the trigram terms generated
`from the query will provide matches against documents containing occurrences not only of china, but also
`chinese and indochina, since trigrams chi and hin are shared. This provides the effect of stemming without
`the necessity of identifying the token boundaries.
`The same approach can also be used for languages with white-space delimited tokens, of course. In prac-
`tice, the choice of n-gram width varies by language, and tends to correlate with the average size of a mor-
`
`AOL Ex. 1027
`Page 6 of 25
`
`

`

`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`529
`
`phological unit––hence English is best represented using n-grams for n in the vicinity of 5 (Mayfield &
`McNamee, 1999), and Chinese is best represented with n of 2 (Meng et al., 2001; Wilkinson, 1997). 4
`For the illustrative experiments in this article that involve Chinese, we have experimented with both heu-
`ristic longest-match segmentation using the NMSU segmenter (Jin, 1992) and with terms based on Chinese
`character bigrams (Section 5.4).
`
`2.1.4. German: decompounding
`Although German uses white-space to separate words, its well known productivity with respect to com-
`pound words raises issues of within-word segmentation similar to those of Chinese. German decompound-
`ing can therefore be viewed in the same optimization and search framework. We apply a dictionary-based,
`greedy approach in our experiments, using the German side of our German–English bilingual term list as
`the segmentation dictionary. Morphological normalization for German terms––either pre- or post-decom-
`pounding––can be addressed using the same sorts of normalization approaches discussed above for English
`and French.
`
`2.2. Document expansion
`
`A quintessential problem in information retrieval is the fact that the same underlying ideas can be rep-
`resented in many different ways in the observed text. One approach to this problem is to represent under-
`lying concepts as hidden variables in a probabilistic model (Kraaij & Hiemstra, 1998; Ponte & Croft, 1997).
`Another is to treat term-based representations as incomplete, and to expand them to include terms repre-
`sentative of the underlying concepts that cannot be extracted explicitly from the text itself. This is the basis
`for widely used query expansion techniques in monolingual information retrieval.
`While query expansion is a well-established technique for both monolingual and cross-language infor-
`mation retrieval, document expansion has only recently been applied to these tasks. The document expan-
`sion approach was first proposed by Singhal and Pereira (1999) in the context of spoken document
`retrieval. Since spoken document retrieval involves search of error-prone automatic speech recognition
`transcriptions, Singhal et al. introduced document expansion as a way of recovering those words that might
`have been in the original broadcast but that had been misrecognized. Their results showed that correctly
`recognized terms yield a topically coherent transcript, while the errors tend not to co-occur in comparable
`documents. Using the document as a query to a comparable collection typically yields documents that con-
`tain some related terms that are highly selective; when those terms are added to the document, improved
`retrieval effectiveness was observed.
`The same idea can be applied in CLIR to find words that the author might have used; this can achieve an
`effect similar to post-translation query expansion. We expanded the original news stories with the most
`selective terms from related documents. First all documents underwent basic term extraction as described
`above. Then each document was reformatted as a query in which all terms were weighted equally. We used
`the full document collection as a comparable collection to be searched for enriching terms. We then selected
`the top five ranked documents (excluding the original document itself) as sources of expansion terms. Next
`we chose highly selective terms from these documents by ranking terms in decreasing order based on inverse
`document frequency. We added an instance of each term to the original document up to one less than the
`number of expansion documents in which it appeared, until we had approximately doubled the original
`
`4 Chinese characters carry semantic content in ways English characters do not, so on average words are approximately two
`characters long and individual characters often carry some recognizable component of meaning.
`
`AOL Ex. 1027
`Page 7 of 25
`
`

`

`530
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`document length. 5 This process sought to maintain the fidelity of the term frequency component for term
`weighting. Finally, we indexed the resulting expanded documents.
`
`3. Query processing
`
`As is the case for document processing, the processing of queries in dictionary-based CLIR depends on
`extraction of query terms in order to represent the information need, perhaps with query expansion to aug-
`ment that representation with additional related terms. In this section we elaborate on the details of pre-
`translation query processing; translation issues are then taken up in Section 4.
`
`3.1. Pre-translation term extraction
`
`In Section 2.1, we focused on the extraction of terms from documents in order to provide term-based
`representations for retrieval. The process of term extraction from queries involves essentially the same lin-
`guistic issues and utilizes many of the same techniques. There is, however, a key difference when extracting
`terms from queries in a query-translation architecture: the terms extracted from queries are going to form
`the basis for translation, not for matching.
`Matching techniques (in monolingual applications) typically seek to enhance recall by conflating differ-
`ences using stemming and segmentation processes, at the expense of some precision. In contrast, for dic-
`tionary-based CLIR, a key concern is mitigating the effects of ambiguity, where multiple terms with
`multiple senses result in an explosion of translation alternatives. We focus on matching the dictionary at
`the highest level of selectivity to minimize ambiguity, and then applying backoff strategies to enhance cov-
`erage only when exact match translations are unavailable.
`As an additional tactic to reduce ambiguity, we take advantage of the observation that multi-word
`expressions rarely have more than one interpretation (e.g., the word ‘‘house’’ in ‘‘White House’’ cannot
`be translated in the verb sense meaning ‘‘to shelter’’); translating multi-word expressions as a unit is well
`known to be helpful for CLIR (Ballesteros & Croft, 1997). Since the task of identifying useful multi-word
`expressions can be viewed a variant of the segmentation optimization problem, we applied the greedy long-
`est-match technique described in Section 2.1.
`
`3.2. Pre-translation expansion
`
`Query expansion is a well-established technique in monolingual information retrieval. Very short (2–3
`word) queries are common in some applications (e.g., Web search). Expansion can help to compensate
`for this kind of incomplete specification of the information need. Brevity can yield ambiguity (reducing pre-
`cision), or may result in omission of terms that are used by authors of the documents that are sought
`(reducing recall). Query expansion using pseudo-relevance feedback had been shown to partially overcome
`these difficulties (Buckley, Salton, Allan, & Singhal, 1994). While expansion in general increases the mean
`retrieval effectiveness, it may also concurrently increase variance across queries.
`Ballesteros and Croft (1997) evaluated pre- and post-translation query expansion in a Spanish–English
`cross-language information retrieval task and found that combining pre- and post-translation query expan-
`sion improved both precision and recall, with pre-translation expansion improving both precision and re-
`call, and post-translation expansion enhancing precision. Mayfield and MacNameeÕs ablation experiments
`
`5 More efficient implementations of the document expansion procedure are possible, but this simple approach suffices to illustrate
`the term selection process and its contribution.
`
`AOL Ex. 1027
`Page 8 of 25
`
`

`

`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`531
`
`on the effect of translation resource size on pre- and post-translation query expansion effectiveness demon-
`strated the dominant role of pre-translation expansion in providing translatable terms (McNamee & May-
`field, 2002). If too few terms are translated, post-translation expansion can provide little improvement. In
`pre-translation query expansion, our goal is both that of monolingual query expansion––providing addi-
`tional terms to refine the query and to enhance the probability of matching the terminology chosen by
`the authors of the document––and providing additional terms to limit the possibility of failing to translate
`a concept in the query simply because the particular term is not present in the translation lexicon.
`We performed the expansion as follows. We constructed the initial query in the normal manner for
`INQUERY (Callan, Croft, & Harding, 1992). We then used INQUERYÕs relevance feedback process to
`obtain expansion terms based on the 10 highest ranked retrieved documents from the contemporaneous
`1994 Los Angeles Times documents, part of the Cross-Language Evaluation Forum (CLEF) (Peters,
`2001) 2000 corpus. Experiments with three, five, and ten expansion documents were conducted with min-
`imal difference in resulting retrieval effectiveness. We concatenated the expansion term set to the original
`query and used the resulting query as the basis for translation. 6
`
`4. Translation knowledge and query translation
`
`For CLIR, translation knowledge provides the crucial bridge between the userÕs information need
`expressed in one language and document concepts expressed the document language. While approaches
`using off-the-shelf machine translation systems alone or in combination with other translation resources
`have been shown to be effective for CLIR tasks (Gey, Jiang, Chen, & Larson, 1998), they are limited to
`the relatively small number of language pairs for which such systems exist. Since our goal is to focus on
`broadly applicable techniques, we focus on the simplest form of a translation lexicon, bilingual term lists,
`which are already available for many language pairs and can be constructed relatively easily for others. A
`bilingual term list is an unordered set of query-language/document-language term translation pairs, often
`with no translation preference or part-of-speech information. In this section, we first describe the bilingual
`term lists that we used in the CLIR experiments reported below. We next describe a general methodology
`for backoff translation that enhances dictionary coverage. We then describe two main strategies for inte-
`grating translation evidence and managing ambiguity through different term weighting techniques. We con-
`clude by presenting two methods to further enhance matching of the translated query with the document
`index through term extraction and expansion processes.
`
`4.1. Bilingual term lists and optimizing coverage
`
`Bilingual term lists are easily found on the Web, often having been created initially for use with simple
`online bilingual dictionary programs. However, since these term lists were constructed for diverse purposes
`and may derive from diverse sources, their ready availability is a great advantage but their possibly eclectic
`structure also presents challenges. They may vary dramatically along several dimensions, including number
`of entries, source, number of multi-word entries, degree of ambiguity, and mix of surface or root form en-
`tries. A key challenge for dictionary-based CLIR systems is to develop techniques to most fully exploit
`these resources while minimizing any negative impact of their more problematic characteristics. A charac-
`terization of the bilingual term lists used in our experiments appears in Table 1. In the cases of English
`
`6 In general, query expansion should involve term reweighting as well, both increasing and decreasing; here we adopt the simple
`strategy of uniform weighting of original and expansion terms.
`
`AOL Ex. 1027
`Page 9 of 25
`
`

`

`532
`
`G.-A. Levow et al. / Information Processing and Management 41 (2005) 523–547
`
`Table 1
`Characterization of bilingual term lists
`
`Translation resource
`
`# English terms
`
`# Document-language terms
`
`Source
`
`English–French
`English–Arabic
`English–Chinese
`English–German
`
`20,100
`137,235
`199,444
`99,357
`
`35,008
`179,152
`395,216
`131,273
`
`http://www.freedict.com
`Web word translations
`CETA + http://www.ldc.upenn.edu
`http://www.quickdic.de
`
`paired with French, Chinese, 7 and German, we used existing static bilingual dictionary resources. Lacking
`an English–Arabic resourc

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket