throbber

`
`CROSS-LANGUAGE
`INFORMATION RETRIEVAL
`
`edited by
`
`Gregory Grefenstette
`Xerox Research Centre Europe
`Grenoble, France
`
`ty
`
`KLUWER ACADEMIC PUBLISHERS
`,
`Boston / Dordrecht / London
`
` i¢it2Pe
`
`
`
`
`
`AOL Ex. 1023
`Page 1 of 11
`
`ieRae—ooaee Sy
`
`—s i Z
`
`a
`
`AOL Ex. 1023
`Page 1 of 11
`
`

`

`Distributors for North America:
`Kluwer Academic Publishers
`101 Philip Drive
`Assinippi Park
`Norwell, Massachusetts 02061 USA
`
`Distributors for all other countries:
`Kluwer Academic Publishers Group
`Distribution Centre
`Post Office Box 322
`3300 AH Dordrecht, THE NETHERLANDS
`
`|
`
`|
`'
`
`
`Library of Congress Cataloging-in-Publication Data
`
`AC.LP.Cataloguerecord forthis bookis available
`from the Library of Congress.
`
`
`
`Thepublisher offers discounts on this book when ordered in bulk quantities. For
`moreinformation contact: Sales Department, Kluwer Academic Publishers,
`101 Philip Drive, Assinippt Park, Norwell, MA 02061
`
`Copyright © 1998 by Kluwer Academic Publishers
`All rights reserved. No part ofthis publication may be reproduced, stored in a
`retrieval system or transmitted in any formor by any means, mechanical, photo-
`copying, recording, or otherwise, without the prior written permission of the
`publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell,
`Massachusetts 02061
`
`Printed on acid-free paper.
`
`Printed in the United States of America
`
`
`
`j
`;
`
`
`
`AOL Ex. 1023
`Page 2 of 11
`
`AOL Ex. 1023
`Page 2 of 11
`
`

`

`
`
`:
`
`“SSSSST
`—===
`
`|
`
`
`
`
`
`THE PROBLEM OF
`
`CROSS-LANGUAGE
`INFORMATION RETRIEVAL
`
`Gregory Grefenstette
`
`Xeroz Research Centre Europe
`6 chemin de Maupertuis
`38240 Meylan, France
`Gregory. Grefenstette@xrce.reror.com
`
`
`
`a
`
`ABSTRACT
`
`As the World Wide Web infiltrates more and more countries, banalizing network,
`interface, and computer system differences which have impeded information access,
`it becomes more common for non-native speakers to explore multilingual text col-
`lections. Beyond merely accepting 8-bit accented characters, information retrieval
`systems should provide help in searching for information across language boundaries.
`This situation has given rise to a new research area called Cross Language Informa-
`tion Retrieval, at the intersection of Machine Translation and Information Retrieval.
`Though sharing some problems in both of these areas, Cross Language Information
`Retrieval poses specific problems, three of which are described in this chapter.
`
`1
`
`INTRODUCTION
`
`Though the elimination of language barriers through the universal adoption of
`one languageis an age-old dream, thefact is and will remain that electronically
`accessible information exists in many different languages. Currently, as the
`World Wide Web becomes better established as a communication means within
`more and more countties, information is being produced in an ever-increasing
`variety of languages on the Internet. A similar situation is developing even
`within the intranets of multinational corporations. This situation exacerbates
`the need to find ways of retrieving information across language boundaries, and
`to understand this information,once retrieved.
`
`
`
`AOL Ex. 1023
`Page 3 of 11
`
`AOL Ex. 1023
`Page 3 of 11
`
`

`

`1
`
`j
`1
`
`|
`
`
`
`i
`;
`
`{
`
`|
`i
`i}
`fi
`|
`i
`
`|
`|
`|
`|
`
`iim |
`1
`|
`a |
`|
`
`'
`
`'
`
`|Sue
`
`2
`
`CHAPTER 1
`
`Computer approaches to understanding foreign language texts range from rapid
`glossing systerm[BSZ95]to full-fledged machine translation systems. But before
`these comprehension-aiding approaches are used, some selection must be made
`among all the documents to which they can be applied. Cross Language Infor-
`mation Retrieval research addresses this initial task of filtering, selecting and
`ranking documents that might be relevant to a query expressed in a different
`language.
`Cross Language Information Retrieval, though related to it, is easier than Ma-
`chine ‘Translation. They have in common that systems developed with either
`approach in mind must produce versions of the same text in different languages,
`but machine translation systems must respect
`two additional constraints of
`choosing one and only one way of expressing a concept, and of producing a
`syntactically correct version of the target language text that reads like natu-
`rally created text. A Cross Language Information Retrieval system has any
`easier job, needing only produce the translated terms to be fed to an informa-
`tion retrieval system, withoutlittle worry about presentation ofits intermediate
`results for human consumption.
`Cross Language Information Retrieval (CLIR) , as a young cousin of Infor-
`mation Retrieval (IR), shares many of the characteristics of the general IR
`problem. The classical Information Retrieval paradigm[SM83a] is the follow-
`ing; a user wants to see documents (e.g. abstracts, paragraphs, articles, Web
`pages) about a certain topic; the user provides a free form textual description
`of the topic as a query; from this query the information retrieval engine derives
`index terms; those index terms are matched against index terms derived from
`previously treated documents; the documents which match best are returned
`to the user in a ranked order. Traditional IR. measures of success are precision
`(how many documents in this rankedlist are really relevant to the initial query)
`and recall (how many of the relevant documents that could possibly be found
`in the document collection are really in thelist).
`Many of the experiments described in this book use these measures of preci-
`sion and recall over commonly available testbeds of documents and queries as
`evaluation measures. But since CLIR is also related to Machine Translation
`(MT),
`it has some specific problems not shared by traditional, monolingual
`text retrieval. Traditional IR can work with the words used in the initial user
`query, and most of the effectiveness of LR systems comes from the overlap be-
`tween these query words,
`including slight morphological alterations, and the
`same words appearing in the relevant documents. The basic intuition is that
`the more a document contains the words appearing in a user query, the more
`likely the docurnent is relevant to that query.
`In CLIR, of course,
`the initial
`
`
`
`AOL Ex. 1023
`Page 4 of 11
`
`AOL Ex. 1023
`Page 4 of 11
`
`

`

`EySicaSnisaeNYNeeMETtee PELSe
`
`Problem of CLIR
`
`3
`
`query is in one language and the documentsare in another. Outside of cognates
`and some proper names which might be written the same in both languages,
`simple string matching mechanisms will rarely work.
`
`2 THE THREE PROBLEMSOF CLIR
`
`Thefirst problem that a CLIR, system mustsolve, then, is knowing how a term
`expressed in one language might be written in another. The second problem
`is deciding which of the possibletranslations should be retained. The third
`problem is deciding how to properly weight the importance of translation al-
`ternatives when more than oneis retained.
`
`The first two problems, how to translate and how to prune alternatives, are
`also endemic to Machine Translation systems. The CLIR system however, has
`the luxury of eliminating some translations while retaining others. Retaining
`ambiguity can be useful in promoting recall in information retrieval system.
`Consider the following example, the French word traitement can be translated
`by English salary or treatment. A Machine Translation system must commit to
`a translation at one point. If the original French query is about waste treatment
`and a CLIR system retains both treatment and salary, then some noise may
`be introduced, but documents about waste treatment will be found that would
`remain unranked by a translation system that chooses the unique but erroneous
`translation salary.
`
`The third CLIR problem,related to how to treat retained alternatives, is some-
`thing that distinguishes CLIR from both Machine Translation and from mono-
`lingual IR. Suppose that the initial query contains two independent search
`terms. If the first term can be translated in many different plausible ways, and
`if the second term can be translated in only one way, theretrieval system should
`not give more weight to the first word merely because it has more translation
`options. This illustrates the translation weighting problem, specific to CLIR
`systems. A document that contains one translation of each query term would
`probably be more relevant than a document that contains many variants of the
`first term’s transtations but none of the second.
`
`Each system presented inthis book attack these three problems in slightly
`different ways. In the next sections, we will review how each of the papers in
`this collection deal with these three problemsof providing translations, pruning
`translations and controlling the weighting of translation alternatives.
`
`
`
`f
`
` ‘
`
`‘|
`i
`
`ii
`o
`is
`
`ii
`
`3
`Is. Eemeunteeseeneeeens
`:
`.
`i
`SSeeSSSerE SSSae
`
`AOL Ex. 1023
`Page 5 of 11
`
`AOL Ex. 1023
`Page 5 of 11
`
`

`

`
`
`4
`
`CHAPTER 1
`
`3 FINDING TRANSLATIONS
`
`3.1 Using Dictionaries
`
`The easiest way to find translations is to have them provided in a bilingual dic-
`tionary. Machine readable bilingual dictionaries exist for many languages. Both
`Ballesteros[BC98] and Davis[Dav98] use an electronic version of the Collins
`English-Spanish Dictionary. Hull{Hul98] has been using a version of the Oxford
`University Press English-Spanish dictionary. But, despite their name which in-
`dicates that they are readily exploitable by computers, machinereadabledictio-
`naries pose many problems. Their content is geared toward humanexploitation
`so much of the information about translations is implicitly included in dictio-
`nary entries. Making this information explicit for use by a computer is no small
`task.
`
`ze
`
`Finding translations useful for Cross Language Information Retrieval in ma-
`chine readable dictionary raises a number of problems. Samples of problems
`are (a) missing word forms: for example, an entry for electrostatic may be in-
`cluded in the dictionary, but the word electrostatically may be missing since
`a human reader can readily reconstruct one form from the other. Stemming
`headwords can mitigate this problem at the expense of increased noise, such as
`seeing marine producingtranslations related to marinated; (b) spelling norms:
`usually only one national form appears as a headword. For example, in a
`bilingual dictionary concerning English, the dictionary would have a heading
`for only one of the spellings, colour or color; (c) spelling conventions: the use
`of hyphenation varies, one can see fallout, fall out and fall-out in text, but
`all variants may not appear in the dictionary; (d) coverage: general language
`dictionaries contain the most common words in a language, but rarer techni-
`cal words are often missing. For example, the 1-million Brown corpus[FK82]
`contains the word radiopasteurization nine times, but this word would rarely
`appear in translation dictionaries; (e) proper names: country names and per-
`sonal names often need to be translated. For example, the Russian president’s
`name is written Yeltsin in English and Elstine in French.
`
` etteeaeeeeeelat=et:
`
`ice
`tl
`
`a)
`5
`
`Even when the headweordis present, finding the translation within the dictio-
`nary ‘entry can be difficult. The translation may be buried in a sample use.
`For example, the translation of a French word like entamer might be contained
`in a phrase enter into,a discussion with someone,
`in which the extra words
`discussion and someone appear. Someone maybeconsidered part of the meta-
`language of the dictionary and thus eliminated but the word discussion is part
`of a sample use and must be identified as extraneous to the translations of
`
`AOL Ex. 1023
`Page 6 of 11
`
`AOL Ex. 1023
`Page 6 of 11
`
`

`

`Problem of CLIR
`
`5
`
`the headword. The specific word that translates the headword may not be
`identifiable by any automatic means. Added to this problem of finding which
`words correspond to translations and which are extra information are common
`structural inconsistencies in the SGML markup of machine-readabledictionar-
`ies, which may or may not appear in the printed version of the dictionary, but
`which often cause an automatic processing of definitions to break down or to
`produce erroneous entries.
`
`In addition to Ballesteros, Davis, and Hull, the systems described in the chap-
`ters by Yamabanaf[YMDK98], Fluhr[FSOt+98], and Gachot[GLY98] also use
`bilingual dictionaries in order to find translation alternatives for Cross Lan-
`guage Information Rétrieval.
`
`Picchi[PP98] describes what appears to be a round-about methodfor finding
`translation alternatives, since they have a bilingual dictionary but they do not
`use it for directly looking up translations.
`Instead, they take each term in
`the source language query and build up a context vector for it from a source
`language corpus covering aspecific domain. ‘The context vector of a given word
`is a weighted list of words that significantly often appear around the given word,
`using the mutual information statistic[CH90] as a measure of significance. ‘The
`words in the context vector are then translated using the bilingual dictionary in
`order to create a target language context vector. Then, using a (not necessarily
`parallel) target language corpus from a similar domain, they look for words
`with similar context vectors in the target language. These words are then used
`as domain-specific translation alternatives. This method has the potential to
`translate terms which do not appearin the bilingual dictionary.
`
`3.2 Using Parallel Corpora
`
`Another option to using translation dictionaries is using a parallel corpus, that:
`is, the same text written in different languages. If the corpus is large enough,
`then simple statistical
`techniques[BCP+90, HdK97] can be used to produce
`bilingual term equivalents by comparing which strings co-occur in the same
`sentences over the whole corpus. But, as mentioned before, the Cross Language
`Information Retrieval problemis slightly different from the Machine Translation
`problem: one does not need to find the exact translation of a given wordin a
`given context, one is looking for documents about a given subject in a different
`language.
`
`
`
` =gopesneaeeSe
`
`aHeatGs
` aSe
`ee
`
`AOL Ex. 1023
`Page7 of 11
`
`AOL Ex. 1023
`Page 7 of 11
`
`

`

`
`
`i
`
`|
`
`i
`
`
`
`6
`
`CHAPTER 1
`
`Work by Sheridan(SB96] illustrates this difference. Using a bilingual corpus of
`newspaper articles, aligned on dates and language-independent subject codes, a
`search wasfirst done on German documents using a German query. Thehighest
`ranking German documents responding to the query were extracted along with
`the Italian documents from the same dates with the same subject codes. From
`these Italian documents, the most frequently appearing words were extracted
`to form an Italian query. In this way, German words are not translated directly,
`but rather a pool of German words is made to correspond to a pool of Italian
`words, with the exact relations between words in each languageleft unspecified.
`This system, called SPIDER, now precalculates a similarity thesaurus over all
`tirne-aligned documents. When a new query comes in, the most similar terms
`in the target language are extracted directly from this thesaurus, Formulas for
`creating this similarity thesaurus are given in [SB96}.
`
`The idea behind this parallel corpora approachis that a group of words forms
`some kind of point in an imaginary semantic space, and that articles in a dif
`ferent language about the same subjects will use words from the same semantic
`space defined in a different language. This idea offinding different words in a
`nearby semantic space also underlies Latent Semantic Indexing methods.
`In the Latent Semantic Indexing approach described by Littman[{LDL98], and
`used also by Oard{OD98], each word from both languages of a parallel corpus
`becomes the row of matrix with the columns corresponding to document num-
`bers. Each entry (m,n) in the matrix correspond to the numberof times the
`word m appears in the document number n. Documents which are translations
`of each other have the same number, so that if a word a is always translated by
`a word b in the corpus, the two rows corresponding to a and 6 will be exactly the
`same. This matrix is usually very large and sparse, having one line per word in
`the collections and one column per document. Latent Semantic Indexing uses a
`matrix decomposition technique called singular-value decomposition[BDO*93]
`which reduces this very large matrix into three matrices, one of which has non-
`zero values only on the diagonal. When this diagonal is sorted from largest to
`smallest value, the small values and their corresponding rows and columns in
`the other two arrays can be thrown away while still producing a matrix very
`similar to the original matrix. This technique has been used to reduce transmis-
`sion rates of pictures from satellites. When appliedto text as described above,
`it reduces the number of dimensions in the semantic space, pushing similarly
`used words closer to each other, even if the words are in different languages.
`Each word is represented by a short vector of real numbers giving its position
`in this reduced space. One can calculate distances between each vector to find
`the most similar words to any one word. In this way, finding translations for a
`
`|
`4
`
`
`
`i
`:
`4
`|
`4
`i
`
`1
`
`iaEs
`
`
` —- eeSS ee
`
`AOL Ex. 1023
`Page 8 of 11
`
`AOL Ex. 1023
`Page 8 of 11
`
`

`

`
`
`Problem of CLIR
`
`7
`
`given word reduces to finding target language words closest to a given source
`language word using these distances.
`Evans [EHM+98] also used such a semantic reduction technique, but instead
`of reducing a word by document matrix, they reduced a symptom by disease
`matrix.
`
`4 PRUNING TRANSLATION
`ALTERNATIVES
`
`Once target languagetranslation equivalents are found for the source language
`words in the user query, the equivalents can be concatenated to form a new
`target language query to submit to the underlying informationretrieval engine.
`One can consider this simple technique as letting the indexed text collection do
`the filtering since any words not appearing in a document will not enter into
`the results. Hull[Hul98] and Fluhr[FSO+98] use this simple corpus filtering
`technique to eliminate some translation alternatives. In Fluhr’s EMIR system,
`a database of known compoundsfrom the target language corpus is also used
`to filter out translations from compound words translated word-to-word.
`Slightly more proactively restrictive techniques are used by Ballesteros[BC98],
`who only uses the first translation alternative given by their on-line dictio-
`nary since in some dictionaries the first translation is the most common; by
`Sheridan[SBS98], who only retains words in their Italian word pool, mentioned
`in the last section, that also appear in a predefined general Italian wordlist;
`and by Gachot[GLY98] who uses domain dependent machine translation with
`a restricted dictionary to reduce translation alternatives, in their system the
`user may specify semantic‘tags which eliminate ambiguities.
`Yamabana[YMDK9§] use the target language corpus as a filter, but first con-
`struct all possible target language candidate noun phrases! by word-to-word
`translation of the source language query terms. The possible noun phrases
`are filtered by using the highest co-occurrence statistics among the candidate
`terms. The results of this variant of translation-by-example[SN90| is presented
`to the user. Interactive translation choice is possible by altering this candidate
`or accessing an online dictionary.
`
`1 Ballesteros and Fluhr also experimentsusing attested noun phrasesin the target language
`in order to filter.translations.
`
`
`
`i
`i
`ti
`
` IEE TEare — —— =
`
`AOL Ex. 1023
`Page 9 of 11
`
`AOL Ex. 1023
`Page 9 of 11
`
`

`

`Sct
`
`
`
`
`
`8
`
`CHAPTER 1
`
`Davis[Dav98] uses a more complex method for choosing among translation al-
`ternatives. First the original English language query is run over the English
`side of a parallel corpus. Then, in order to choose the best Spanish translation
`of the English query terms, each Spanish alternative is run as a query over
`the Spanish side of the corpus, and the alternative that produces the ranking
`most like the English one is chosen as the translation equivalent. English terms
`having no Spanish translation are run as fuzzy matches over the Spanish data,
`so that cognates or near cognates can be recognized.
`The problem of pruning translations can be seen as one of removing ambiguity
`if one considers that different translations entail different nuances of meaning.
`The research based upon LSI does not have this problem of pruning translation
`alternatives, since the method maps words into a single point in semantic space.
`The technique used by used by Littman, Oard and Evans, supposes that a word
`only has one global sense in the parallel corpus and that the sense’s translations
`in different languages map into the same pointin the semantic space.
`
`5 WEIGHTING TRAN SLATION
`ALTERNATIVES
`As mentioned previously, Cross Language Information Retrieval need not re-
`solve all the ambiguity among possible translations of a word. If the system
`allows more than one possible translation, some alteration of the underlying
`information retrieval system should be made to compensate for this situation.
`In classical information retrieval, the number of tirnes that a word appears in
`the query influences the importance of that word when documents are ranked
`against the query. In Cross Language Information Retrieval, if a word retains
`many translations, the weight of that word would be artificially inflated if the
`query is simply sent to a classic informationretrieval engine.
`Let’s take an example. Suppose that the source language query is the Wald-
`heim affair. When translated into French, Waldheim remains Waldheim but
`affair might be translated as aventure, business, affaire , case and liaison.
`If the translated query is simply fed into the information retrieval engine as
`Waldheim, aventure, business, affaire, case, liaison, then documents mention-
`ing some ofthe last five words might be ranked higher than those mentioning
`
`Waldheim.
`
`etbeeey
`
`=e a eS SSS SSSSee
`
`Tee
`
`AOL Ex. 1023
`Page 10 of 11
`
`AOL Ex. 1023
`Page 10 of 11
`
`

`

`Problem of CLIR.
`
`9
`
`il
`
`Many of the systems presented here do nottreat this problem, hoping that
`the different pruning strategies will mitigate the effect by limiting the number
`of translation that get matched against the documents. Systems that perform
`strict disambiguation do not need to worry about weighting, since one word in
`the source language is translated by one word in the target language. However,
`with corpus-based techniques like tHose creating similarity thesaurus or using
`LSI, there is no guarantee that all query concepts will be represented after
`translation.
`Hull[Hul98] directly addresses the problem and proposes a weighted boolean
`scheme in which all the translation alternatives stemming from a given source
`language query term are loosely OR-ed, so that their number does not influence
`the ranked results any more than the original query term would influence the
`ranking of source language documents.
`
`6 CONCLUSION
`Cross Language Information Retrieval field brings together two distinct lines
`of research:
`information retrieval and machine translation, sharing aspects of
`both, But, just as at any other juncture of human effort, new, specific prob-
`lems arise. We have named three of the major problems of Cross Language
`Information Retrieval:
`finding translations, pruning translation and weight-
`ing translation alternatives; and described how the systems in this book deal
`with them. There are, of course, many other problems dealing with multilin-
`gual collections, such as properly treating multiple characters sets, maintaining
`variant collating orders, normalizing accentuation, separating languages within
`a collection via language recognition[Gre95] language-specific stemming rou-
`tines or morphological analysis, visualizing text, glossing results, etc, Many of
`these linguistic engineering and computational linguistic problems remain to
`be solved.
`
`Acknowledgements
`I would like to thank David Hull of the Xerox Research Centre Europe for
`comments which influenced thefinal revision of this chapter.
`
`Saas=
`
`i
`|
`
`
` ad
`
`|
`i
`
`AOL Ex. 1023
`Page 11 of 11
`
`AOL Ex. 1023
`Page 11 of 11
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket