`
`CROSS-LANGUAGE
`INFORMATION RETRIEVAL
`
`edited by
`
`Gregory Grefenstette
`Xerox Research Centre Europe
`Grenoble, France
`
`ty
`
`KLUWER ACADEMIC PUBLISHERS
`,
`Boston / Dordrecht / London
`
` i¢it2Pe
`
`
`
`
`
`AOL Ex. 1023
`Page 1 of 11
`
`ieRae—ooaee Sy
`
`—s i Z
`
`a
`
`AOL Ex. 1023
`Page 1 of 11
`
`
`
`Distributors for North America:
`Kluwer Academic Publishers
`101 Philip Drive
`Assinippi Park
`Norwell, Massachusetts 02061 USA
`
`Distributors for all other countries:
`Kluwer Academic Publishers Group
`Distribution Centre
`Post Office Box 322
`3300 AH Dordrecht, THE NETHERLANDS
`
`|
`
`|
`'
`
`
`Library of Congress Cataloging-in-Publication Data
`
`AC.LP.Cataloguerecord forthis bookis available
`from the Library of Congress.
`
`
`
`Thepublisher offers discounts on this book when ordered in bulk quantities. For
`moreinformation contact: Sales Department, Kluwer Academic Publishers,
`101 Philip Drive, Assinippt Park, Norwell, MA 02061
`
`Copyright © 1998 by Kluwer Academic Publishers
`All rights reserved. No part ofthis publication may be reproduced, stored in a
`retrieval system or transmitted in any formor by any means, mechanical, photo-
`copying, recording, or otherwise, without the prior written permission of the
`publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell,
`Massachusetts 02061
`
`Printed on acid-free paper.
`
`Printed in the United States of America
`
`
`
`j
`;
`
`
`
`AOL Ex. 1023
`Page 2 of 11
`
`AOL Ex. 1023
`Page 2 of 11
`
`
`
`
`
`:
`
`“SSSSST
`—===
`
`|
`
`
`
`
`
`THE PROBLEM OF
`
`CROSS-LANGUAGE
`INFORMATION RETRIEVAL
`
`Gregory Grefenstette
`
`Xeroz Research Centre Europe
`6 chemin de Maupertuis
`38240 Meylan, France
`Gregory. Grefenstette@xrce.reror.com
`
`
`
`a
`
`ABSTRACT
`
`As the World Wide Web infiltrates more and more countries, banalizing network,
`interface, and computer system differences which have impeded information access,
`it becomes more common for non-native speakers to explore multilingual text col-
`lections. Beyond merely accepting 8-bit accented characters, information retrieval
`systems should provide help in searching for information across language boundaries.
`This situation has given rise to a new research area called Cross Language Informa-
`tion Retrieval, at the intersection of Machine Translation and Information Retrieval.
`Though sharing some problems in both of these areas, Cross Language Information
`Retrieval poses specific problems, three of which are described in this chapter.
`
`1
`
`INTRODUCTION
`
`Though the elimination of language barriers through the universal adoption of
`one languageis an age-old dream, thefact is and will remain that electronically
`accessible information exists in many different languages. Currently, as the
`World Wide Web becomes better established as a communication means within
`more and more countties, information is being produced in an ever-increasing
`variety of languages on the Internet. A similar situation is developing even
`within the intranets of multinational corporations. This situation exacerbates
`the need to find ways of retrieving information across language boundaries, and
`to understand this information,once retrieved.
`
`
`
`AOL Ex. 1023
`Page 3 of 11
`
`AOL Ex. 1023
`Page 3 of 11
`
`
`
`1
`
`j
`1
`
`|
`
`
`
`i
`;
`
`{
`
`|
`i
`i}
`fi
`|
`i
`
`|
`|
`|
`|
`
`iim |
`1
`|
`a |
`|
`
`'
`
`'
`
`|Sue
`
`2
`
`CHAPTER 1
`
`Computer approaches to understanding foreign language texts range from rapid
`glossing systerm[BSZ95]to full-fledged machine translation systems. But before
`these comprehension-aiding approaches are used, some selection must be made
`among all the documents to which they can be applied. Cross Language Infor-
`mation Retrieval research addresses this initial task of filtering, selecting and
`ranking documents that might be relevant to a query expressed in a different
`language.
`Cross Language Information Retrieval, though related to it, is easier than Ma-
`chine ‘Translation. They have in common that systems developed with either
`approach in mind must produce versions of the same text in different languages,
`but machine translation systems must respect
`two additional constraints of
`choosing one and only one way of expressing a concept, and of producing a
`syntactically correct version of the target language text that reads like natu-
`rally created text. A Cross Language Information Retrieval system has any
`easier job, needing only produce the translated terms to be fed to an informa-
`tion retrieval system, withoutlittle worry about presentation ofits intermediate
`results for human consumption.
`Cross Language Information Retrieval (CLIR) , as a young cousin of Infor-
`mation Retrieval (IR), shares many of the characteristics of the general IR
`problem. The classical Information Retrieval paradigm[SM83a] is the follow-
`ing; a user wants to see documents (e.g. abstracts, paragraphs, articles, Web
`pages) about a certain topic; the user provides a free form textual description
`of the topic as a query; from this query the information retrieval engine derives
`index terms; those index terms are matched against index terms derived from
`previously treated documents; the documents which match best are returned
`to the user in a ranked order. Traditional IR. measures of success are precision
`(how many documents in this rankedlist are really relevant to the initial query)
`and recall (how many of the relevant documents that could possibly be found
`in the document collection are really in thelist).
`Many of the experiments described in this book use these measures of preci-
`sion and recall over commonly available testbeds of documents and queries as
`evaluation measures. But since CLIR is also related to Machine Translation
`(MT),
`it has some specific problems not shared by traditional, monolingual
`text retrieval. Traditional IR can work with the words used in the initial user
`query, and most of the effectiveness of LR systems comes from the overlap be-
`tween these query words,
`including slight morphological alterations, and the
`same words appearing in the relevant documents. The basic intuition is that
`the more a document contains the words appearing in a user query, the more
`likely the docurnent is relevant to that query.
`In CLIR, of course,
`the initial
`
`
`
`AOL Ex. 1023
`Page 4 of 11
`
`AOL Ex. 1023
`Page 4 of 11
`
`
`
`EySicaSnisaeNYNeeMETtee PELSe
`
`Problem of CLIR
`
`3
`
`query is in one language and the documentsare in another. Outside of cognates
`and some proper names which might be written the same in both languages,
`simple string matching mechanisms will rarely work.
`
`2 THE THREE PROBLEMSOF CLIR
`
`Thefirst problem that a CLIR, system mustsolve, then, is knowing how a term
`expressed in one language might be written in another. The second problem
`is deciding which of the possibletranslations should be retained. The third
`problem is deciding how to properly weight the importance of translation al-
`ternatives when more than oneis retained.
`
`The first two problems, how to translate and how to prune alternatives, are
`also endemic to Machine Translation systems. The CLIR system however, has
`the luxury of eliminating some translations while retaining others. Retaining
`ambiguity can be useful in promoting recall in information retrieval system.
`Consider the following example, the French word traitement can be translated
`by English salary or treatment. A Machine Translation system must commit to
`a translation at one point. If the original French query is about waste treatment
`and a CLIR system retains both treatment and salary, then some noise may
`be introduced, but documents about waste treatment will be found that would
`remain unranked by a translation system that chooses the unique but erroneous
`translation salary.
`
`The third CLIR problem,related to how to treat retained alternatives, is some-
`thing that distinguishes CLIR from both Machine Translation and from mono-
`lingual IR. Suppose that the initial query contains two independent search
`terms. If the first term can be translated in many different plausible ways, and
`if the second term can be translated in only one way, theretrieval system should
`not give more weight to the first word merely because it has more translation
`options. This illustrates the translation weighting problem, specific to CLIR
`systems. A document that contains one translation of each query term would
`probably be more relevant than a document that contains many variants of the
`first term’s transtations but none of the second.
`
`Each system presented inthis book attack these three problems in slightly
`different ways. In the next sections, we will review how each of the papers in
`this collection deal with these three problemsof providing translations, pruning
`translations and controlling the weighting of translation alternatives.
`
`
`
`f
`
` ‘
`
`‘|
`i
`
`ii
`o
`is
`
`ii
`
`3
`Is. Eemeunteeseeneeeens
`:
`.
`i
`SSeeSSSerE SSSae
`
`AOL Ex. 1023
`Page 5 of 11
`
`AOL Ex. 1023
`Page 5 of 11
`
`
`
`
`
`4
`
`CHAPTER 1
`
`3 FINDING TRANSLATIONS
`
`3.1 Using Dictionaries
`
`The easiest way to find translations is to have them provided in a bilingual dic-
`tionary. Machine readable bilingual dictionaries exist for many languages. Both
`Ballesteros[BC98] and Davis[Dav98] use an electronic version of the Collins
`English-Spanish Dictionary. Hull{Hul98] has been using a version of the Oxford
`University Press English-Spanish dictionary. But, despite their name which in-
`dicates that they are readily exploitable by computers, machinereadabledictio-
`naries pose many problems. Their content is geared toward humanexploitation
`so much of the information about translations is implicitly included in dictio-
`nary entries. Making this information explicit for use by a computer is no small
`task.
`
`ze
`
`Finding translations useful for Cross Language Information Retrieval in ma-
`chine readable dictionary raises a number of problems. Samples of problems
`are (a) missing word forms: for example, an entry for electrostatic may be in-
`cluded in the dictionary, but the word electrostatically may be missing since
`a human reader can readily reconstruct one form from the other. Stemming
`headwords can mitigate this problem at the expense of increased noise, such as
`seeing marine producingtranslations related to marinated; (b) spelling norms:
`usually only one national form appears as a headword. For example, in a
`bilingual dictionary concerning English, the dictionary would have a heading
`for only one of the spellings, colour or color; (c) spelling conventions: the use
`of hyphenation varies, one can see fallout, fall out and fall-out in text, but
`all variants may not appear in the dictionary; (d) coverage: general language
`dictionaries contain the most common words in a language, but rarer techni-
`cal words are often missing. For example, the 1-million Brown corpus[FK82]
`contains the word radiopasteurization nine times, but this word would rarely
`appear in translation dictionaries; (e) proper names: country names and per-
`sonal names often need to be translated. For example, the Russian president’s
`name is written Yeltsin in English and Elstine in French.
`
` etteeaeeeeeelat=et:
`
`ice
`tl
`
`a)
`5
`
`Even when the headweordis present, finding the translation within the dictio-
`nary ‘entry can be difficult. The translation may be buried in a sample use.
`For example, the translation of a French word like entamer might be contained
`in a phrase enter into,a discussion with someone,
`in which the extra words
`discussion and someone appear. Someone maybeconsidered part of the meta-
`language of the dictionary and thus eliminated but the word discussion is part
`of a sample use and must be identified as extraneous to the translations of
`
`AOL Ex. 1023
`Page 6 of 11
`
`AOL Ex. 1023
`Page 6 of 11
`
`
`
`Problem of CLIR
`
`5
`
`the headword. The specific word that translates the headword may not be
`identifiable by any automatic means. Added to this problem of finding which
`words correspond to translations and which are extra information are common
`structural inconsistencies in the SGML markup of machine-readabledictionar-
`ies, which may or may not appear in the printed version of the dictionary, but
`which often cause an automatic processing of definitions to break down or to
`produce erroneous entries.
`
`In addition to Ballesteros, Davis, and Hull, the systems described in the chap-
`ters by Yamabanaf[YMDK98], Fluhr[FSOt+98], and Gachot[GLY98] also use
`bilingual dictionaries in order to find translation alternatives for Cross Lan-
`guage Information Rétrieval.
`
`Picchi[PP98] describes what appears to be a round-about methodfor finding
`translation alternatives, since they have a bilingual dictionary but they do not
`use it for directly looking up translations.
`Instead, they take each term in
`the source language query and build up a context vector for it from a source
`language corpus covering aspecific domain. ‘The context vector of a given word
`is a weighted list of words that significantly often appear around the given word,
`using the mutual information statistic[CH90] as a measure of significance. ‘The
`words in the context vector are then translated using the bilingual dictionary in
`order to create a target language context vector. Then, using a (not necessarily
`parallel) target language corpus from a similar domain, they look for words
`with similar context vectors in the target language. These words are then used
`as domain-specific translation alternatives. This method has the potential to
`translate terms which do not appearin the bilingual dictionary.
`
`3.2 Using Parallel Corpora
`
`Another option to using translation dictionaries is using a parallel corpus, that:
`is, the same text written in different languages. If the corpus is large enough,
`then simple statistical
`techniques[BCP+90, HdK97] can be used to produce
`bilingual term equivalents by comparing which strings co-occur in the same
`sentences over the whole corpus. But, as mentioned before, the Cross Language
`Information Retrieval problemis slightly different from the Machine Translation
`problem: one does not need to find the exact translation of a given wordin a
`given context, one is looking for documents about a given subject in a different
`language.
`
`
`
` =gopesneaeeSe
`
`aHeatGs
` aSe
`ee
`
`AOL Ex. 1023
`Page7 of 11
`
`AOL Ex. 1023
`Page 7 of 11
`
`
`
`
`
`i
`
`|
`
`i
`
`
`
`6
`
`CHAPTER 1
`
`Work by Sheridan(SB96] illustrates this difference. Using a bilingual corpus of
`newspaper articles, aligned on dates and language-independent subject codes, a
`search wasfirst done on German documents using a German query. Thehighest
`ranking German documents responding to the query were extracted along with
`the Italian documents from the same dates with the same subject codes. From
`these Italian documents, the most frequently appearing words were extracted
`to form an Italian query. In this way, German words are not translated directly,
`but rather a pool of German words is made to correspond to a pool of Italian
`words, with the exact relations between words in each languageleft unspecified.
`This system, called SPIDER, now precalculates a similarity thesaurus over all
`tirne-aligned documents. When a new query comes in, the most similar terms
`in the target language are extracted directly from this thesaurus, Formulas for
`creating this similarity thesaurus are given in [SB96}.
`
`The idea behind this parallel corpora approachis that a group of words forms
`some kind of point in an imaginary semantic space, and that articles in a dif
`ferent language about the same subjects will use words from the same semantic
`space defined in a different language. This idea offinding different words in a
`nearby semantic space also underlies Latent Semantic Indexing methods.
`In the Latent Semantic Indexing approach described by Littman[{LDL98], and
`used also by Oard{OD98], each word from both languages of a parallel corpus
`becomes the row of matrix with the columns corresponding to document num-
`bers. Each entry (m,n) in the matrix correspond to the numberof times the
`word m appears in the document number n. Documents which are translations
`of each other have the same number, so that if a word a is always translated by
`a word b in the corpus, the two rows corresponding to a and 6 will be exactly the
`same. This matrix is usually very large and sparse, having one line per word in
`the collections and one column per document. Latent Semantic Indexing uses a
`matrix decomposition technique called singular-value decomposition[BDO*93]
`which reduces this very large matrix into three matrices, one of which has non-
`zero values only on the diagonal. When this diagonal is sorted from largest to
`smallest value, the small values and their corresponding rows and columns in
`the other two arrays can be thrown away while still producing a matrix very
`similar to the original matrix. This technique has been used to reduce transmis-
`sion rates of pictures from satellites. When appliedto text as described above,
`it reduces the number of dimensions in the semantic space, pushing similarly
`used words closer to each other, even if the words are in different languages.
`Each word is represented by a short vector of real numbers giving its position
`in this reduced space. One can calculate distances between each vector to find
`the most similar words to any one word. In this way, finding translations for a
`
`|
`4
`
`
`
`i
`:
`4
`|
`4
`i
`
`1
`
`iaEs
`
`
` —- eeSS ee
`
`AOL Ex. 1023
`Page 8 of 11
`
`AOL Ex. 1023
`Page 8 of 11
`
`
`
`
`
`Problem of CLIR
`
`7
`
`given word reduces to finding target language words closest to a given source
`language word using these distances.
`Evans [EHM+98] also used such a semantic reduction technique, but instead
`of reducing a word by document matrix, they reduced a symptom by disease
`matrix.
`
`4 PRUNING TRANSLATION
`ALTERNATIVES
`
`Once target languagetranslation equivalents are found for the source language
`words in the user query, the equivalents can be concatenated to form a new
`target language query to submit to the underlying informationretrieval engine.
`One can consider this simple technique as letting the indexed text collection do
`the filtering since any words not appearing in a document will not enter into
`the results. Hull[Hul98] and Fluhr[FSO+98] use this simple corpus filtering
`technique to eliminate some translation alternatives. In Fluhr’s EMIR system,
`a database of known compoundsfrom the target language corpus is also used
`to filter out translations from compound words translated word-to-word.
`Slightly more proactively restrictive techniques are used by Ballesteros[BC98],
`who only uses the first translation alternative given by their on-line dictio-
`nary since in some dictionaries the first translation is the most common; by
`Sheridan[SBS98], who only retains words in their Italian word pool, mentioned
`in the last section, that also appear in a predefined general Italian wordlist;
`and by Gachot[GLY98] who uses domain dependent machine translation with
`a restricted dictionary to reduce translation alternatives, in their system the
`user may specify semantic‘tags which eliminate ambiguities.
`Yamabana[YMDK9§] use the target language corpus as a filter, but first con-
`struct all possible target language candidate noun phrases! by word-to-word
`translation of the source language query terms. The possible noun phrases
`are filtered by using the highest co-occurrence statistics among the candidate
`terms. The results of this variant of translation-by-example[SN90| is presented
`to the user. Interactive translation choice is possible by altering this candidate
`or accessing an online dictionary.
`
`1 Ballesteros and Fluhr also experimentsusing attested noun phrasesin the target language
`in order to filter.translations.
`
`
`
`i
`i
`ti
`
` IEE TEare — —— =
`
`AOL Ex. 1023
`Page 9 of 11
`
`AOL Ex. 1023
`Page 9 of 11
`
`
`
`Sct
`
`
`
`
`
`8
`
`CHAPTER 1
`
`Davis[Dav98] uses a more complex method for choosing among translation al-
`ternatives. First the original English language query is run over the English
`side of a parallel corpus. Then, in order to choose the best Spanish translation
`of the English query terms, each Spanish alternative is run as a query over
`the Spanish side of the corpus, and the alternative that produces the ranking
`most like the English one is chosen as the translation equivalent. English terms
`having no Spanish translation are run as fuzzy matches over the Spanish data,
`so that cognates or near cognates can be recognized.
`The problem of pruning translations can be seen as one of removing ambiguity
`if one considers that different translations entail different nuances of meaning.
`The research based upon LSI does not have this problem of pruning translation
`alternatives, since the method maps words into a single point in semantic space.
`The technique used by used by Littman, Oard and Evans, supposes that a word
`only has one global sense in the parallel corpus and that the sense’s translations
`in different languages map into the same pointin the semantic space.
`
`5 WEIGHTING TRAN SLATION
`ALTERNATIVES
`As mentioned previously, Cross Language Information Retrieval need not re-
`solve all the ambiguity among possible translations of a word. If the system
`allows more than one possible translation, some alteration of the underlying
`information retrieval system should be made to compensate for this situation.
`In classical information retrieval, the number of tirnes that a word appears in
`the query influences the importance of that word when documents are ranked
`against the query. In Cross Language Information Retrieval, if a word retains
`many translations, the weight of that word would be artificially inflated if the
`query is simply sent to a classic informationretrieval engine.
`Let’s take an example. Suppose that the source language query is the Wald-
`heim affair. When translated into French, Waldheim remains Waldheim but
`affair might be translated as aventure, business, affaire , case and liaison.
`If the translated query is simply fed into the information retrieval engine as
`Waldheim, aventure, business, affaire, case, liaison, then documents mention-
`ing some ofthe last five words might be ranked higher than those mentioning
`
`Waldheim.
`
`etbeeey
`
`=e a eS SSS SSSSee
`
`Tee
`
`AOL Ex. 1023
`Page 10 of 11
`
`AOL Ex. 1023
`Page 10 of 11
`
`
`
`Problem of CLIR.
`
`9
`
`il
`
`Many of the systems presented here do nottreat this problem, hoping that
`the different pruning strategies will mitigate the effect by limiting the number
`of translation that get matched against the documents. Systems that perform
`strict disambiguation do not need to worry about weighting, since one word in
`the source language is translated by one word in the target language. However,
`with corpus-based techniques like tHose creating similarity thesaurus or using
`LSI, there is no guarantee that all query concepts will be represented after
`translation.
`Hull[Hul98] directly addresses the problem and proposes a weighted boolean
`scheme in which all the translation alternatives stemming from a given source
`language query term are loosely OR-ed, so that their number does not influence
`the ranked results any more than the original query term would influence the
`ranking of source language documents.
`
`6 CONCLUSION
`Cross Language Information Retrieval field brings together two distinct lines
`of research:
`information retrieval and machine translation, sharing aspects of
`both, But, just as at any other juncture of human effort, new, specific prob-
`lems arise. We have named three of the major problems of Cross Language
`Information Retrieval:
`finding translations, pruning translation and weight-
`ing translation alternatives; and described how the systems in this book deal
`with them. There are, of course, many other problems dealing with multilin-
`gual collections, such as properly treating multiple characters sets, maintaining
`variant collating orders, normalizing accentuation, separating languages within
`a collection via language recognition[Gre95] language-specific stemming rou-
`tines or morphological analysis, visualizing text, glossing results, etc, Many of
`these linguistic engineering and computational linguistic problems remain to
`be solved.
`
`Acknowledgements
`I would like to thank David Hull of the Xerox Research Centre Europe for
`comments which influenced thefinal revision of this chapter.
`
`Saas=
`
`i
`|
`
`
` ad
`
`|
`i
`
`AOL Ex. 1023
`Page 11 of 11
`
`AOL Ex. 1023
`Page 11 of 11
`
`